Curso 7 Regression Models

Introducción

En estadística, el análisis de la regresión es un proceso estadístico para estimar las relaciones entre variables. Incluye muchas técnicas para el modelado y análisis de diversas variables, cuando la atención se centra en la relación entre una variable dependiente (o Outcome o Variable de respuesta) y una o más variables independientes (o Covariables o Variables regresoras ).

Este curso cubre análisis de regresión, mínimos cuadrados e inferencia usando modelos de regresión. Los casos especiales de modelos de regresión ANOVA y ANCOVA también serán cubiertos. El análisis de los residuos y la variablidad también se investigará. Los modelos de regresión producen modelos muy facilmente interpretables, al contrario que los algoritmos de aprendizaje automático que muchas veces sacrifican interpretabilidad por un mejor rendimiento en la predicción. Esto por supuesto es el fin de toda estimación, sin embargo por simpleza e interpretabilidad deberíamos optar siempre en primer lugar por estos modelos de regresión.

Podemos clasificar los tipos de regresión según diversos criterios.

En primer lugar, en función del número de variables independientes:

Regresión simple: Cuando la variable Y depende únicamente de una única variable X.

Regresión múltiple: Cuando la variable Y depende de varias variables (X1, X2, ..., Xr)

En segundo lugar, en función del tipo de función f(X):

Regresión lineal: Cuando f(X) es una función lineal.

Regresión no lineal: Cuando f(X) no es una función lineal.

En tercer lugar, en función de la naturaleza de la relación que exista entre las dos variables:

La variable X puede ser la causa del valor de la variable Y.

Por ejemplo, en toxicología, si X = Dosis de la droga e Y = Mortalidad, la mortalidad se atribuye a la dosis administrada y no a otras causas.

Puede haber simplemente relación entre las dos variables.

Por ejemplo, en un estudio de medicina en que se estudian las variables X = Peso e Y = Altura de un grupo de individuos, puede haber relación entre las dos, aunque difícilmente una pueda considerarse causa de la otra.

Concepto de Regresión lineal y mínimos cuadrados

Los modelos lineales, relaccionan una salida con un conjunto de predictores de interás usando asunciones lineales. Son un subconjunto de los modelos de regresión, que son la herramienta más importante herramienta de análisis estadístico para un científico de datos. Un modelo de regresión lineal por tanto tendrá siempre la siguiente forma.

Y = A + BX + u

Y= variable dependiente o endógena

X= variable independiente o explicativa

A, B = parámetros fijos y desconocidos

Notación

  • letras normales (i.e. \(X\), \(Y\)) = se usan para representar variables observadas
  • La letras griegas (i.e. \(\mu\), \(\sigma\)) = se usan para representar variables desconocidas y que queremos estimar como la media o varianza de la población
  • \(X_1, X_2, \ldots, X_n\) describe \(n\) puntos de datos
  • \(\bar X\), \(\bar Y\) = medias observadas de las variables aleatorias \(X\) and \(Y\)
  • \(\hat \beta_0\), \(\hat \beta_1\) = estimaciones de los valores reales \(\beta_0\) and \(\beta_1\)

Media de la muestra

  • Media de la muestra se define como \[\bar X = \frac{1}{n}\sum_{i=1}^n X_i\]
  • centrar la varible aleatoria es restar a todos sus valores la media de la muestra, por lo que tras el proceso la curva se centrará en el origen \[\tilde X_i = X_i - \bar X\]
    • mean of \(\tilde X_i\) = 0

Desviación estándar y varianza

  • varianza se define como \[ S^2 = \frac{1}{n-1} \sum_{i=1}^n (X_i - \bar X)^2 = \frac{1}{n-1} \left( \sum_{i=1}^n X_i^2 - n \bar X ^ 2 \right) \Leftarrow \mbox{shortcut for calculation}\]
  • desviación estándar se define como \(S = \sqrt{S^2}\)
    • Al haber realizado la raiz cuadrada de la varianza, tenemos la desviación estándar en las mismas unidades que los datos
  • escalar las variables aleatorias es dividir todos sus valores por la desviación estándar \(X_i / S\)
    • standard deviation of \(X_i / S\) = 1

Normalizar

  • Normalizar una variable aleatoria es centrarla y escalarla \[Z_i = \frac{X_i - \bar X}{s}\]
    • empirical mean = 0, empirical standard deviation = 1
    • Conseguimos que la distribución esté siempre centrada en 0 y que los datos muestren que están a unidades = # desviaciones estándar de la media original
      • Ejemplo: \(Z_1 = 2\) significaría que punto Z está a 2 desviaciones estándar más allá de la media original

LA NORMALIZACIÓN PERMITE COMPARAR DATOS NO COMPARABLES INCLUSO QUE TENGAN DISTINTAS UNIDADES

Covarianza y Correlación

  • Tenemos \((X_i, Y_i)\) = como pareja de datos

  • covarianza se define como \[ Cov(X, Y) = \frac{1}{n-1}\sum_{i=1}^n (X_i - \bar X) (Y_i - \bar Y) = \frac{1}{n-1}\left( \sum_{i=1}^n X_i Y_i - n \bar X \bar Y\right) \]
    • tiene como unidades \(X \veces\) unidades de \(Y\)
  • correlación se define como \[Cor(X, Y) = \frac{Cov(X, Y)}{S_x S_y}\] donde \(S_x\) y \(S_y\) son la estimación de las desviaciones estándar para las observaciones de \(X\) y de \(Y\) respectivamente.

    • El valor es efectivamente la covarianza estandarizada en una cantidad sin unidades

    • \(Cor(X, Y) = Cor(Y, X)\)

    • \(-1 \leq Cor(X, Y) \leq 1\)

    • \(Cor(X,Y) = 1\) and \(Cor(X, Y) = -1\) sólo cuando las observaciones \(X\) o \(Y\) caen perfectamente en una linea de pendiente positiva y negativa respectivamente

    • \(Cor(X, Y)\) mide la fortaleza de la relación lineal entre \(X\) e \(Y\), con una relacción mucho más fuerte cada vez que \(Cor(X,Y)\) se acerca a -1 o 1

    • \(Cor(X, Y) = 0\) implica que no hay relación lineal.

qownnotes-media-dXspNO

qownnotes-media-dXspNO

qownnotes-media-CwzJNE

qownnotes-media-CwzJNE

qownnotes-media-OXitOI

qownnotes-media-OXitOI

Galton intentaba inferir la altura de los hijos a partir de la altura de los padres.

qownnotes-media-TlRIOL

qownnotes-media-TlRIOL

{r galton,fig.height=3.5,fig.width=8}
library(UsingR); data(galton); library(reshape); long <- melt(galton)
g <- ggplot(long, aes(x = value, fill = variable)) 
g <- g + geom_histogram(colour = "black", binwidth=1) 
g <- g + facet_grid(. ~ variable)
g
qownnotes-media-SqyrfP

qownnotes-media-SqyrfP

Antes de ponernos a ver si podemos inferir la altura de los hijos por la altura de los padres a través del método de los mínimos cuadrados, explicaremos que el mínimo cuadrado es el valor mínimo de cada resta de un valor menos la media, todo ello al cuadrado.

qownnotes-media-ndPBIN

qownnotes-media-ndPBIN

qownnotes-media-sjtBnE

qownnotes-media-sjtBnE

# load necessary packages/install if needed
library(ggplot2); library(UsingR); data(galton)
# function to plot the histograms
myHist <- function(mu){
    # calculate the mean squares
    mse <- mean((galton$child - mu)^2)
    # plot histogram
    g <- ggplot(galton, aes(x = child)) + geom_histogram(fill = "salmon",
        colour = "black", binwidth=1)
    # add vertical line marking the center value mu
    g <- g + geom_vline(xintercept = mu, size = 2)
    g <- g + ggtitle(paste("mu = ", mu, ", MSE = ", round(mse, 2), sep = ""))
    g
}
# manipulate allows the user to change the variable mu to see how the mean squares changes
#   library(manipulate); manipulate(myHist(mu), mu = slider(62, 74, step = 0.5))]
# plot the correct graph
myHist(mean(galton$child))

El resultado como podemos observar es que la media (lógicamente) es el valor que minimiza al máximo el valor del mínimo cuadrado, si elegimos otro valor como media el mse ya no será el mínimo.

Una vez sabemos el valor de mse, vamos a mostrar un gráfico de 928 puntos con las alturas de los padres y los hijos emparejados. Las alturas de las mujeres han sido ajustadas a 1.08 respecto a los hombre para poder hacer la media. En el plot usamos la función jitter para que no se superpongan mucho los valores coincidentes.

{r,dependson="galton",fig.height=4,fig.width=4, fig.align='center'}
ggplot(galton, aes(x = parent, y = jitter(child))) + geom_point(shape=1)
qownnotes-media-BUMWUG

qownnotes-media-BUMWUG

Ahora lo que haremos será añadir una linea al gráfico de color azul. Esta línea representa la mínima variación de los datos alrededor de él. Si la pendiente de la recta es mayor que cero significa que la altura de los padres afecta a la altura de los hijos.

La pendiente de la recta es la estimación del coeficiente, o multiplicador de los padres, la variable independiente de nuestros datos. Con summary se puede ver la pendiente de la recta de regresión:

regrline <- lm(child ~ parent,galton)

summary(regrline)

Call: lm(formula = child ~ parent, data = galton)

Residuals: Min 1Q Median 3Q Max -7.8050 -1.3661 0.0487 1.6339 5.9264

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 23.94153 2.81088 8.517 <2e-16

parent 0.64629 0.04114 15.711 <2e-16

Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ‘’ 1

Residual standard error: 2.239 on 926 degrees of freedom Multiple R-squared: 0.2105, Adjusted R-squared: 0.2096 F-statistic: 246.8 on 1 and 926 DF, p-value: < 2.2e-16

Vamos a añadir ahora dos líneas azules para indicar las medias de las alturas de los padres y de los hijos . Cabe destacar que ambas líneas se cruzan en el mismo punto, justo donde está el punto medio de la recta.

La pendiente de la curva roja muestra cuánto cambia en la dirección vertical en un cambio de la dirección horizontal, vemos que padres que están 1 pulgada por encima de la media suelen tener hijos que sólo están a 0.65 pulgadas de la media, el triángulo verde indica este punto.

qownnotes-media-DBHmup

qownnotes-media-DBHmup

Concepto de Residuos

  • Residuo, \(e_i\) = diferencia entre la salida observada y la que se ha estimado \[e_i = Y_i - \hat Y_i\]
    • También, distancia vertical entre el dato observado y la linea de regresión
    • Los mínimos cuadrados minimizan \(\sum_{i=1}^n e_i^2\)
  • \(e_i\) puede ser interpretado como el estimador del error de regresión, \(\epsilon_i\)
  • \(e_i\) también puede ser interpretado como la salida (\(Y\)) con asociación lineal del predictor (\(X\)) eliminada
    • o, “Y ajustada a X”
  • \(E[e_i] = 0\) \(\rightarrow\) esto es porque la media de los residuos se espera que sea 0 (se asume distribución gaussiana)
    • Esta distribución de Gauss implica que el error NO está en correlación con ningún predictor
    • mean(fitModel$residuals) = devuelve la media de los residuos \(\rightarrow\) debe ser igual a 0
    • cov(fit$residuals, predictors) = devuelve la covarianza (mide correlación) entre residuos y predictores \(\rightarrow\) debe ser igual a 0
  • \(\sum_{i=1}^n e_i = 0\) (si el interceptor está incluido) and \(\sum_{i=1}^n e_i X_i = 0\) (si una variable regresora \(X_i\) está incluida)
  • para modelos de regresión lineal estándar
    • positive residuals = por encima de la linea
    • negative residuals = por debajo
  • Los residuos al ponerser en gráfica puden destacar una pobre convergencia del modelo

A continuación hay algunos ejemplos de cómo calcular los residuos, las distancias entre las alturas reales de los hijos y la estimada dada la línea de regresión. Como todas las líneas, está caracterizada por dos parámetros, una pendiente y un intercept, usaremos los mínimos cuadrados para proporcionar dos ecuaciones en dos datos desconocidos, de manera que podamos resolver estos dos parámetros

|====== | 6%

The first equation says that the “errors” in our estimates, the residuals, have mean zero. In other words, the residuals are “balanced” among the data points; they’re just as likely to be positive as negative. The second equation says that our residuals must be uncorrelated with our predictors, the parents’ height. This makes sense - if the residuals and predictors were correlated then you could make a better prediction and reduce the distances (residuals) between the actual outcomes and the predictions.

|========== | 9%

We’ll demonstrate these concepts now. First regenerate the regression line and call it fit. Use the R function lm. Recall that by default its first argument is a formula such as “child ~ parent” and its second is the dataset, in this case galton.

lm(child ~ parent,galton)

Call: lm(formula = child ~ parent, data = galton)

Coefficients: (Intercept) parent
23.9415 0.6463

You seem to have misspelled the model’s name. I was expecting fit but you apparently typed .
You’re close…I can feel it! Try it again. Or, type info() for more options.
Type “fit <- lm(child ~ parent, galton)” at the R prompt.

fit <- lm(child ~ parent,galton)

You are really on a roll!

|============= | 12%

Now we’ll examine fit to see its slope and intercept. The residuals we’re interested in are stored in the 928-long vector fit\(residuals. If you type fit\)residuals you’ll see a lot of numbers scroll by which isn’t very useful; however if you type “summary(fit)” you will see a more concise display of the regression data. Do this now.

summary(fit)

Call: lm(formula = child ~ parent, data = galton)

Residuals: Min 1Q Median 3Q Max -7.8050 -1.3661 0.0487 1.6339 5.9264

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 23.94153 2.81088 8.517 <2e-16 ***

parent 0.64629 0.04114 15.711 <2e-16

|========================== | 25%

We’ll verify these claims now. We’ve defined for you two R functions, est and sqe. Both take two inputs, a slope and an intercept. The function est calculates a child’s height (y-coordinate) using the line defined by the two parameters, (slope and intercept), and the parents’ heights in the Galton data as x-coordinates.

|============================= | 28%

Let “mch” represent the mean of the galton childrens’ heights and “mph” the mean of the galton parents’ heights. Let “ic” and “slope” represent the intercept and slope of the regression line respectively. As shown in the slides and past lessons, the point (mph,mch) lies on the regression line. This means

1: I haven’t the slightest idea. 2: mch = ic + slopemph 3: mph = ic + slopemch

Selection: 3

One more time. You can do it!
A line is the set of all points (x,y) satisfying the equation y = mx + b, where m is the slope of the line and b is its intercept. Remember that the point (mph,mch) lies on the regression line with intercept ic and slope “slope”.

1: mph = ic + slopemch 2: I haven’t the slightest idea. 3: mch = ic + slopemph

Selection: 3

You got it right!

|================================ | 31%

The function sqe calculates the sum of the squared residuals, the differences between the actual children’s heights and the estimated heights specified by the line defined by the given parameters (slope and intercept). R provides the function deviance to do exactly this using a fitted model (e.g., fit) as its argument. However, we provide sqe because we’ll use it to test regression lines different from fit.

|==================================== | 34%

We’ll see that when we vary or tweak the slope and intercept values of the regression line which are stored in fit$coef, the resulting squared residuals are approximately equal to the sum of two sums of squares - that of the original regression residuals and that of the tweaks themselves. More precisely, up to numerical error,

|======================================= | 38%

sqe(ols.slope+sl,ols.intercept+ic) == deviance(fit) + sum(est(sl,ic)^2 )

|========================================== | 41%

Equivalently, sqe(ols.slope+sl,ols.intercept+ic) == sqe(ols.slope, ols.intercept) + sum(est(sl,ic)^2 )

|============================================== | 44%

The left side of the equation represents the squared residuals of a new line, the “tweaked” regression line. The terms “sl” and “ic” represent the variations in the slope and intercept respectively. The right side has two terms. The first represents the squared residuals of the original regression line and the second is the sum of squares of the variations themselves.

|================================================= | 47%

We’ll demonstrate this now. First extract the intercept from fit$coef and put it in a variable called ols.ic . The intercept is the first element in the fit\(coef vector, that is fit\)coef[1].

ols.ic <- fit$coef[1]

You are doing so well!

|==================================================== | 50%

Now extract the slope from fit$coef and put it in the variable ols.slope; the slope is the second element in the fit\(coef vector, fit\)coef[2].

ols.slope <- fit$coef[2]

You are doing so well!

|======================================================= | 53%

Now we’ll show you some R code which generates the left and right sides of this equation. Take a moment to look it over. We’ve formed two 6-long vectors of variations, one for the slope and one for the intercept. Then we have two “for” loops to generate the two sides of the equation.

|========================================================== | 56%

Subtract the right side, the vector rhs, from the left, the vector lhs, to see the relationship between them. You should get a vector of very small, almost 0, numbers.

rhs -lhs [1] -1.264198e-09 -2.527486e-09 -3.801688e-09 1.261469e-09 2.522938e-09 3.767127e-09

Almost! Try again. Or, type info() for more options.
Type “lhs-rhs” at the R prompt.

lhs-rhs [1] 1.264198e-09 2.527486e-09 3.801688e-09 -1.261469e-09 -2.522938e-09 -3.767127e-09

That’s correct!

|============================================================== | 59%

You could also use the R function all.equal with lhs and rhs as arguments to test for equality. Try it now.

all.equal(lhs,rhs) [1] TRUE

You nailed it! Good job!

|================================================================= | 62%

Now we’ll show that the variance in the children’s heights is the sum of the variance in the OLS estimates and the variance in the OLS residuals. First use the R function var to calculate the variance in the children’s heights and store it in the variable varChild.

varchild <- var(galton$child)

Keep trying! Or, type info() for more options.
Type “varChild <- var(galton$child)” at the R prompt.

varChild <- var(galton$child)

That’s the answer I was looking for.

|==================================================================== | 66%

Remember that we’ve calculated the residuals and they’re stored in fit$residuals. Use the R function var to calculate the variance in these residuals now and store it in the variable varRes.

varRes <- var(fit$residuals)

That’s the answer I was looking for.

|======================================================================== | 69%

Recall that the function “est” calculates the estimates (y-coordinates) of values along the regression line defined by the variables “ols.slope” and “ols.ic”. Compute the variance in the estimates and store it in the variable varEst.

varEst(est(slope = ols.slope,intercept = ols.ic)) Error: could not find function “varEst” varEST <- var(est(slope = ols.slope,intercept = ols.ic))

You’re close…I can feel it! Try it again. Or, type info() for more options.
Type “varEst <- var(est(ols.slope, ols.ic))” at the R prompt.

varEst <- var(est(ols.slope, ols.ic))

That’s correct!

|=========================================================================== | 72%

Now use the function all.equal to compare varChild and the sum of varRes and varEst.

all.equal(varChild,sum(varRes,varEst)) [1] TRUE

Give it another try. Or, type info() for more options.
Type “all.equal(varChild,varEst+varRes)” at the R prompt.

all.equal(varChild,varEst+varRes) [1] TRUE

You’re the best!

|============================================================================== | 75%

Since variances are sums of squares (and hence always positive), this equation which we’ve just demonstrated, var(data)=var(estimate)+var(residuals), shows that the variance of the estimate is ALWAYS less than the variance of the data.

|================================================================================= | 78%

Since var(data)=var(estimate)+var(residuals) and variances are always positive, the variance of residuals

1: is less than the variance of data 2: is unknown without actual data 3: is greater than the variance of data

Selection: 1

You are amazing!

|==================================================================================== | 81%

The two properties of the residuals we’ve emphasized here can be applied to datasets which have multiple predictors. In this lesson we’ve loaded the dataset attenu which gives data for 23 earthquakes in California. Accelerations are estimated based on two predictors, distance and magnitude.

|======================================================================================== | 84%

Generate the regression line for this data. Type efit <- lm(accel ~ mag+dist, attenu) at the R prompt.

efit <- lm(accel ~ mag+dist, attenu)

That’s correct!

|=========================================================================================== | 88%

Verify the mean of the residuals is 0.

mean(efit$residuals) [1] -1.785061e-18

Nice work!

|============================================================================================== | 91%

Using the R function cov verify the residuals are uncorrelated with the magnitude predictor, attenu$mag.

cov(efit\(residuals,efit\)mag) Error in cov(efit\(residuals, efit\)mag) : supply both ‘x’ and ‘y’ or a matrix-like ‘x’ cov(x = efit\(residuals,y = efit\)mag) Error in cov(x = efit\(residuals, y = efit\)mag) : supply both ‘x’ and ‘y’ or a matrix-like ‘x’ cov(x = efit\(residuals) Error in cov(x = efit\)residuals) : supply both ‘x’ and ‘y’ or a matrix-like ‘x’ cov() Error in is.data.frame(x) : argument “x” is missing, with no default efit$mag NULL

That’s not the answer I was looking for, but try again. Or, type info() for more options.
Type “cov(efit\(residuals, attenu\)mag)” at the R prompt.

cov(efit\(residuals, attenu\)mag) [1] 5.338694e-17

Your dedication is inspiring!

|================================================================================================== | 94%

Using the R function cov verify the residuals are uncorrelated with the distance predictor, attenu$dist.

cov(efit\(residuals, attenu\)dist) [1] 5.253433e-16

Concepto de estimación por mínimos cuadrados

## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:Hmisc':
## 
##     src, summarize
## The following object is masked from 'package:MASS':
## 
##     select
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Warning: Ignoring unknown aesthetics: show_guide

qownnotes-media-UYlNgi

qownnotes-media-UYlNgi

qownnotes-media-gaybyM

qownnotes-media-gaybyM

qownnotes-media-BpebuK

qownnotes-media-BpebuK

qownnotes-media-kWwUpE

qownnotes-media-kWwUpE

qownnotes-media-nBqHaI

qownnotes-media-nBqHaI

qownnotes-media-CmQKYz

qownnotes-media-CmQKYz

## Warning: Ignoring unknown aesthetics: show_guide

Como podemos ver, la linea de regresión resume la relación entre la altura de los padres (predictor) y la altura de los hijos (outcome)

Ejercicios

We learned in the last lesson that the regression line is the line through the data which has the minimum (least) squared “error”, the vertical distance between the 928 actual children’s heights and the heights predicted by the line. Squaring the distances ensures that data points above and below the line are treated the same. This method of choosing the ‘best’ regression line (or ‘fitting’ a line to the data) is known as ordinary least squares.
As shown in the slides, the slope of the regression line is the correlation between the two sets of heights multiplied by the ratio of the standard deviations (childrens’ to parents’ or outcomes to predictors).

|================================= | 32%

Here we show code which demonstrates how changing the slope of the regression line affects the mean squared error between actual and predicted values. Look it over to see how straightforward it is.
qownnotes-media-yULohl

qownnotes-media-yULohl

Now you can actually play with the code to use R’s manipulate function and find the minimum squared error. You can adjust the slider with the left mouse button or use the right and left arrow keys to see how changing the slope (beta) affects the mean squared error (mse). If the slider disappears you can call it back by clicking on the little gear in the upper left corner of the plot window.

|================================================= | 47%

Which value of the slope minimizes the mean squared error?

1: .64 2: .44 3: 5 4: .70

Great job!

|======================================================= | 53%

What was the minimum mse?

1: .64 2: .66 3: 5.0 4: 44

Selection: 2

You’re close…I can feel it! Try it again.
You don’t want an error that’s too big or too small.

1: .64 2: 44 3: .66 4: 5.0

Selection: 4

Nice work!

|============================================================ | 58%

Recall that you normalize data by subtracting its mean and dividing by its standard deviation. We’ve done this for the galton child and parent data for you. We’ve stored these normalized values in two vectors, gpa_nor and gch_nor, the normalized galton parent and child data.

|================================================================== | 63%

Use R’s function “cor” to compute the correlation between these normalized data sets.

cor(gch_nor,gpa_nor) [1] 0.4587624

Your dedication is inspiring!

|======================================================================= | 68%

How does this correlation relate to the correlation of the unnormalized data?

1: It is the same. 2: It is bigger. 3: It is smaller.

Selection: 1

You’re the best!

|============================================================================= | 74%

Use R’s function “lm” to generate the regression line using this normalized data. Store it in a variable called l_nor. Use the parents’ heights as the predictors (independent variable) and the childrens’ as the predicted (dependent). Remember, ‘lm’ needs a formula of the form dependent ~ independent. Since we’ve created the data vectors for you there’s no need to provide a second “data” argument as you have previously.

l_nor <- lm(gch_nor~gpa_nor)

You are amazing!

|================================================================================== | 79%

What is the slope of this line?

1: I have no idea 2: 1. 3: The correlation of the 2 data sets

Selection: 3

All that practice is paying off!

|======================================================================================== | 84%

If you swapped the outcome (Y) and predictor (X) of your original (unnormalized) data, (for example, used childrens’ heights to predict their parents), what would the slope of the new regression line be?

1: the same as the original 2: I have no idea 3: 1. 4: correlation(X,Y) * sd(X)/sd(Y)

Selection: 4

Excellent job!

|============================================================================================= | 89%

We’ll close with a final display of source code from the slides. It plots the galton data with three regression lines, the original in red with the children as the outcome, a new blue line with the parents’ as outcome and childrens’ as predictor, and a black line with the slope scaled so it equals the ratio of the standard deviations.
qownnotes-media-LnkyEN

qownnotes-media-LnkyEN

TEST 1

qownnotes-media-iTHEhs

qownnotes-media-iTHEhs

qownnotes-media-bTsdUo

qownnotes-media-bTsdUo

qownnotes-media-egzUYt

qownnotes-media-egzUYt

qownnotes-media-jSoMNj

qownnotes-media-jSoMNj

qownnotes-media-rGCbis

qownnotes-media-rGCbis

qownnotes-media-CMiOzc

qownnotes-media-CMiOzc

qownnotes-media-rPUzYm

qownnotes-media-rPUzYm

qownnotes-media-UQjwkz

qownnotes-media-UQjwkz

qownnotes-media-elvFIV

qownnotes-media-elvFIV

Aplicación de regresión lineal y mínimos cuadrados

Hasta ahora sólo hemos cosiderado hasta ahora la estimación, que es útil, pero también necesitamos saber cómo extender nuestras estimaciones a la población. Éste es el proceso de inferencia estadística. Nuestra aproximación a la inferencia estadística será a través de modelos estadísticos. Como mínimo, necesitamos algunas asunciones en la distribución de errores, nos focalizaremos en modelos sobre gausanidad.

Vamos a ver cómo hacer a través de mínimos cuadrados cálculos sobre al ecuación modificado el punto de corte con y (intercept) y la escala de la pendiente (por ejemplo porque tenga un valor muy alto).

Usamos el ejemplo del precio de los diamantes con respecto a sus quilates (carat)

Plot of the data

Intercept es el corte con X y el valor de carat es la pendiente de la recta, cada incremento de 1 kilate supone 3721$

fit <- lm(price ~ carat, data = diamond)
coef(fit)
## (Intercept)       carat 
##   -259.6259   3721.0249
qownnotes-media-jiuSWl

qownnotes-media-jiuSWl

No queremos para nada saber el precio de 0 kilates (-259.6), por lo que lo que podemos calcular el precio de la media de kilates para el valor de intercept Para hacer un cálculo aritmétco dentro de una función hay que poner I( )

Getting a more interpretable intercept

fit2 <- lm(price ~ I(carat - mean(carat)), data = diamond)
coef(fit2)
##            (Intercept) I(carat - mean(carat)) 
##               500.0833              3721.0249

Ahora vemos que el cambio de 1 kilate es demasiado alto, por lo que nos puede interesar escalarlo para tener una mejor idea del valor

fit3 <- lm(price ~ I(carat * 10), data = diamond)
coef(fit3)
##   (Intercept) I(carat * 10) 
##     -259.6259      372.1025
qownnotes-media-FUyJBN

qownnotes-media-FUyJBN

Nuestro fin último es poder estimar el valor de nuevos diamantes, en este caso queremos inferir el precio de 3 diamantes

Predicting the price of a diamond

newx <- c(0.16, 0.27, 0.34)
coef(fit)[1] + coef(fit)[2] * newx
## [1]  335.7381  745.0508 1005.5225
predict(fit, newdata = data.frame(carat = newx))
##         1         2         3 
##  335.7381  745.0508 1005.5225
qownnotes-media-xCTQRV

qownnotes-media-xCTQRV

Si hacemos predict(fit) nos devolverá únicamente los valores predecidos para las x del dataset

Podemos mostrar los valores que hemos predecido en la misma gráfica

Ejercicios

In this lesson we’ll illustrate that regression in many variables amounts to a series of regressions in one. Using regression in one variable, we’ll show how to eliminate any chosen regressor, thus reducing a regression in N variables, to a regression in N-1. Hence, if we know how to do a regression in 1 variable, we can do a regression in 2. Once we know how to do a regression in 2 variables, we can do a regression in 3, and so on. We begin with the galton data and a review of eliminating the intercept by subtracting the means.
When we perform a regression in one variable, such as lm(child ~ parent, galton), we get two coefficients, a slope and an intercept. The intercept is really the coefficient of a special regressor which has the same value, 1, at every sample. The function, lm, includes this regressor by default.
We’ll demonstrate by substituting an all-ones regressor of our own. This regressor must have the same number of samples as galton (928.) Create such an object and name it ones, using ones <- rep(1, nrow(galton)), or some equivalent expression.

ones <- rep(1, nrow(galton))

You’re the best!
The galton data has already been loaded. The default intercept can be excluded by using -1 in the formula. Perform a regression which substitutes our regressor, ones, for the default using lm(child ~ ones + parent -1, galton). Since we want the result to print, don’t assign it to a variable.

lm(child ~ ones + parent -1, galton)

Call: lm(formula = child ~ ones + parent - 1, data = galton)

Coefficients: ones parent
23.9415 0.6463

That’s the answer I was looking for.
The coefficient of ones is 23.9415. Now use the default, lm(child ~ parent, galton), to show the intercept has the same value. This time, DO NOT suppress the intercept with -1.

lm(child ~ parent, galton)

Call: lm(formula = child ~ parent, data = galton)

Coefficients: (Intercept) parent
23.9415 0.6463

Excellent job!
The regression in one variable given by lm(child ~ parent, galton) really involves two regressors, the variable, parent, and a regressor of all ones.

1: True 2: False

Selection: 1

All that practice is paying off!
In earlier lessons we demonstrated that the regression line given by lm(child ~ parent, galton) goes through the point x=mean(parent), y=mean(child). We also showed that if we subtract the mean from each variable, the regression line goes through the origin, x=0, y=0, hence its intercept is zero. Thus, by subtracting the means, we eliminate one of the two regressors, the constant, leaving just one, parent. The coefficient of the remaining regressor is the slope.
qownnotes-media-bLHzWL

qownnotes-media-bLHzWL

Subtracting the means to eliminate the intercept is a special case of a general technique which is sometimes called Gaussian Elimination. As it applies here, the general technique is to pick one regressor and to replace all other variables by the residuals of their regressions against that one.

|======================================== | 35%

Suppose, as claimed, that subtracting a variable’s mean is a special case of replacing the variable with a residual. In this special case, it would be the residual of a regression against what?

1: The variable itself 2: The outcome 3: The constant, 1

Selection: 3

That’s the answer I was looking for.

|============================================= | 38%

The mean of a variable is the coefficient of its regression against the constant, 1. Thus, subtracting the mean is equivalent to replacing a variable by the residual of its regression against 1. In an R formula, the constant regressor can be represented by a 1 on the right hand side. Thus, the expression, lm(child ~ 1, galton), regresses child against the constant, 1. Recall that in the galton data, the mean height of a child was 68.09 inches. Use lm(child ~ 1, galton) to compare the resulting coefficient (the intercept) and the mean height of 68.09. Since we want the result to print, don’t assign it a name.

lm(child ~ 1, galton)

Call: lm(formula = child ~ 1, data = galton)

Coefficients: (Intercept)
68.09

That’s a job well done!
The mean of a variable is equal to its regression against the constant, 1.

1: True 2: False

Selection: 1

All that hard work is paying off!
To illustrate the general case we’ll use the trees data from the datasets package. The idea is to predict the Volume of timber which a tree might produce from measurements of its Height and Girth. To avoid treating the intercept as a special case, we have added a column of 1’s to the data which we shall use in its place. Please take a moment to inspect the data using either View(trees) or head(trees).

head(trees) Constant Girth Height Volume 1 1 8.3 70 10.3 2 1 8.6 65 10.3 3 1 8.8 63 10.2 4 1 10.5 72 16.4 5 1 10.7 81 18.8 6 1 10.8 83 19.7

Excellent job!
A file of relevant code has been copied to your working directory and sourced. The file, elimination.R, should have appeared in your editor. If not, please open it manually.

# Regress the given variable on the given predictor,
# suppressing the intercept, and return the residual.
regressOneOnOne <- function(predictor, other, dataframe){
  # Point A. Create a formula such as Girth ~ Height -1
  formula <- paste0(other, " ~ ", predictor, " - 1")
  # Use the formula in a regression and return the residual.
  resid(lm(formula, dataframe))
}

# Eliminate the specified predictor from the dataframe by
# regressing all other variables on that predictor
# and returning a data frame containing the residuals
# of those regressions.
eliminate <- function(predictor, dataframe){
  # Find the names of all columns except the predictor.
  others <- setdiff(names(dataframe), predictor)
  # Calculate the residuals of each when regressed against the given predictor
  temp <- sapply(others, function(other)regressOneOnOne(predictor, other, dataframe))
  # sapply returns a matrix of residuals; convert to a data frame and return.
  as.data.frame(temp)
}
The general technique is to pick one predictor and to replace all other variables by the residuals of their regressions against that one. The function, regressOneOnOne, in eliminate.R performs the first step of this process. Given the name of a predictor and one other variable, other, it returns the residual of other when regressed against predictor. In its first line, labeled Point A, it creates a formula. Suppose that predictor were ‘Girth’ and other were ‘Volume’. What formula would

1: Volume ~ Girth 2: Girth ~ Volume - 1 3: Volume ~ Girth - 1

Selection: 3

That’s correct!
The remaining function, eliminate, applies regressOneOnOne to all variables except a given predictor and collects the residuals in a data frame. We’ll first show that when we eliminate one regressor from the data, a regression on the remaining will produce their correct coefficients. (Of course, the coefficient of the eliminated regressor will be missing, but more about that later.)
For reference, create a model named fit, based on all three regressors, Girth, Height, and Constant, and assign the result to a variable named fit. Use an expression such as fit <- lm(Volume ~ Girth + Height + Constant -1, trees). Don’t forget the -1, and be sure to name the model fit for later use.

fit <- lm(Volume ~ Girth + Height + Constant -1, trees)

Great job!
Now let’s eliminate Girth from the data set. Call the reduced data set trees2 to indicate it has only 2 regressors. Use the expression trees2 <- eliminate(“Girth”, trees).

trees2 <- eliminate(“Girth”, trees)

Keep working like that and you’ll get there!
Use head(trees2) or View(trees2) to inspect the reduced data set.

head(trees2) Constant Height Volume 1 0.4057735 24.38809 -9.793826 2 0.3842954 17.73947 -10.520109 3 0.3699767 14.64038 -11.104298 4 0.2482677 14.29818 -9.019900 5 0.2339490 22.19910 -7.104089 6 0.2267896 23.64956 -6.446183

Your dedication is inspiring!
Why, in trees2, is the Constant column not constant?

1: Computational precision was insufficient. 2: The constant, 1, has been replaced by its residual when regressed against Girth. 3: There must be some mistake

Selection: View(trees2) Enter an item from the menu, or 0 to exit Selection: 2

Keep working like that and you’ll get there!
Now create a model, called fit2, using the reduced data set. Use an expression such as fit2 <- lm(Volume ~ Height + Constant -1, trees2). Don’t forget to use -1 in the formula.

fit2 <- lm(Volume ~ Height + Constant-1, trees2)

Excellent work!
Use the expression lapply(list(fit, fit2), coef) to print coefficients of fit and fit2 for comparison.

lapply(list(fit, fit2), coef) [[1]] Girth Height Constant 4.7081605 0.3392512 -57.9876589

[[2]] Height Constant 0.3392512 -57.9876589

You are amazing!
The coefficient of the eliminated variable is missing, of course. One way to get it would be to go back to the original data, trees, eliminate a different regressor, such as Height, and do another 2 variable regession, as above. There are much more efficient ways, but efficiency is not the point of this demonstration. We have shown how to reduce a regression in 3 variables to a regression in 2. We can go further and eliminate another variable, reducing a regression in 2 variables to a regression in 1.
Here is the final step. We have used eliminate(“Height”, trees2) to reduce the data to the outcome, Volume, and the Constant regressor. We have regressed Volume on Constant, and printed the coefficient as shown in the command above the answer. As you can see, the coefficient of Constant agrees with previous values.

Call: lm(formula = Volume ~ Constant - 1, data = eliminate(“Height”, trees2))

Coefficients: Constant
-57.99

Suppose we were given a multivariable regression problem involving an outcome and N regressors, where N > 1. Using only single-variable regression, how can the problem be reduced to a problem with only N-1 regressors?

1: Subtract the mean from the outcome and each regressor. 2: Pick any regressor and replace the outcome and all other regressors by their residuals against the chosen one.

Selection: 2

You are really on a roll!
We have illustrated that regression in many variables amounts to a series of regressions in one. The actual algorithms used by functions such as lm are more efficient, but are computationally equivalent to what we have done. That is, the algorithms use equivalent steps but combine them more efficiently and abstractly. This completes the lesson.

Varianza de Residuos

Los residuos representan la variación que no ha podido ser explicada por nuestro modelo. Es importante notar la diferencia entre residuos y errores. Los errores son errores reales no observados desde los coeficientes conocidos, mientras que los residuos son errores observables desde los coeficientes observados. De alguna manera, los residuos son la estimación de los errores

Los residuos son la diferencia entre la estimación de las y y el valor real que en conjunto de datos. Tenemos que recordar que para que la regresión lineal sea aplicable, hay una serie de hipótesis como que la suma de los residuos es cero y que la varianza de los residuos es uniforme

Recordamos que si hacemos predict(fit) obtenemos los valores de la predicción de las x en lugar de su valor real, aquí verermos cómo la diferencia entre calcular los residuos con resid() y calculándolo a mano, la máxima diferencia entre ambos resultados es muy cercano a 0

qownnotes-media-pNWqxF

qownnotes-media-pNWqxF

Además podemos comprobar una de nuestras hipótesis para poder aplcar la regresión lineal, y es que la suma de los reiduos es 0, así como si lo multiplicamos por x

qownnotes-media-kbAnjm

qownnotes-media-kbAnjm

{r, echo = FALSE, fig.height=5, fig.width=5}
plot(diamond$carat, diamond$price,  
     xlab = "Mass (carats)", 
     ylab = "Price (SIN $)", 
     bg = "lightblue", 
     col = "black", cex = 2, pch = 21,frame = FALSE)
abline(fit, lwd = 2)
for (i in 1 : n) 
  lines(c(x[i], x[i]), c(y[i], yhat[i]), col = "red" , lwd = 2)
qownnotes-media-DuQGjF

qownnotes-media-DuQGjF

Muchas veces los residuos nos permiten ver patrones en los datos, por lo que es muy útil ponerlo en el eje de las x

{r, echo = FALSE, fig.height=5, fig.width=5}
n = 
plot(x, e,  
     xlab = "Mass (carats)", 
     ylab = "Residuals (SIN $)", 
     bg = "lightblue", 
     col = "black", cex = 2, pch = 21,frame = FALSE)
abline(h = 0, lwd = 2)
for (i in 1 : n) 
  lines(c(x[i], x[i]), c(e[i], 0), col = "red" , lwd = 2)

Este no es el caso ya que nuestro modelo se ajusta muy bien, pero por ejemplo nos ha servido para ver que hay muchos casos en los que para un mismo tamaño, no es siempre el mismo precio

qownnotes-media-QLKGdA

qownnotes-media-QLKGdA

Hay veces, que aunque los datos no sean lineales, mostrar su recta de regresión es muy útil para detectar su patrón, aunque la recta o no nos sirva para hacer una predicción adecuada (the model doesnt fit)

No linear data (metemos sen() en la función y introduciendo también algo de ruido normalizado, en las x una distribución uniforme runif,)

{r, echo = FALSE, fig.height=5, fig.width=5}
x = runif(100, -3, 3); y = x + sin(x) + rnorm(100, sd = .2); 
library(ggplot2)
g = ggplot(data.frame(x = x, y = y), aes(x = x, y = y))
g = g + geom_smooth(method = "lm", colour = "black")
g = g + geom_point(size = 7, colour = "black", alpha = 0.4)
g = g + geom_point(size = 5, colour = "red", alpha = 0.4)
g
qownnotes-media-NprLkt

qownnotes-media-NprLkt

Así a primera vista nos podría parecer un modelo acertado, pero si observamos de cerca sus residuos:

{r, echo = FALSE, fig.height=5, fig.width=5}
g = ggplot(data.frame(x = x, y = resid(lm(y ~ x))), 
           aes(x = x, y = y))
g = g + geom_hline(yintercept = 0, size = 2); 
g = g + geom_point(size = 7, colour = "black", alpha = 0.4)
g = g + geom_point(size = 5, colour = "red", alpha = 0.4)
g = g + xlab("X") + ylab("Residual")
g

Vemos que realmente hay mucha variabilidad respecto a la recta de regresión

qownnotes-media-SuXmwZ

qownnotes-media-SuXmwZ

E incluso vemos el patrón mucho más claramente en los datos

qownnotes-media-KBfCUc

qownnotes-media-KBfCUc

Otra de las cosas que podemos comprobar es si se cumple la segunda hipótesis, que es que la distribución de la varianza de los residuos sea uniforme. esto lo vamos a comprobar con la heterocedasticidad

qownnotes-media-uMHevD

qownnotes-media-uMHevD

Aquí aparentemente vemos que la recta es muy acertada

{r, echo = FALSE, fig.height=4.5, fig.width=4.5}
x <- runif(100, 0, 6); y <- x + rnorm(100,  mean = 0, sd = .001 * x); 
g = ggplot(data.frame(x = x, y = y), aes(x = x, y = y))
g = g + geom_smooth(method = "lm", colour = "black")
g = g + geom_point(size = 7, colour = "black", alpha = 0.4)
g = g + geom_point(size = 5, colour = "red", alpha = 0.4)
g
qownnotes-media-oYqzkP

qownnotes-media-oYqzkP

Sin embargo si echamos un vistazo más profundo a los residuos, vemos que su distribución no es uniforme, si no que crece con las x, esto es lo que se llama heterocedasticidad, y lo conseguimos metiendo las x dentro de la función rnorm de las y

y <- x + rnorm(100, mean = 0, sd = .001 * x)

{r, echo = FALSE, fig.height=4.5, fig.width=4.5}
g = ggplot(data.frame(x = x, y = resid(lm(y ~ x))), 
           aes(x = x, y = y))
g = g + geom_hline(yintercept = 0, size = 2); 
g = g + geom_point(size = 7, colour = "black", alpha = 0.4)
g = g + geom_point(size = 5, colour = "red", alpha = 0.4)
g = g + xlab("X") + ylab("Residual")
g
qownnotes-media-NyOltz

qownnotes-media-NyOltz

Volviendo al ejempo de los diamantes, vamos a añadir al dataset los residuos de cada par de valores, y vamos a dibujarlos frente a los kilates, los residuos tendrán siempre las unidades de las y

{r, echo = FALSE, fig.height=4.5, fig.width=4.5}
diamond$e <- resid(lm(price ~ carat, data = diamond))
g = ggplot(diamond, aes(x = carat, y = e))
g = g + xlab("Mass (carats)")
g = g + ylab("Residual price (SIN $)")
g = g + geom_hline(yintercept = 0, size = 2)
g = g + geom_point(size = 7, colour = "black", alpha=0.5)
g = g + geom_point(size = 5, colour = "blue", alpha=0.2)
g
qownnotes-media-ByaBpa

qownnotes-media-ByaBpa

Lo primero que hacemos es calcular los residuos frente a la media del precio (esto es frente al interception), el otro vector es ls residuos frente a la recta de regresión

Luego creo dos facores, uno de interception y otro de inter y slope

{r, echo = FALSE, fig.height=4.5, fig.width=4.5}
e = c(resid(lm(price ~ 1, data = diamond)),
      resid(lm(price ~ carat, data = diamond)))
fit = factor(c(rep("Itc", nrow(diamond)),
               rep("Itc, slope", nrow(diamond))))
g = ggplot(data.frame(e = e, fit = fit), aes(y = e, x = fit, fill = fit))
g = g + geom_dotplot(binaxis = "y", size = 2, stackdir = "center", binwidth = 20)
g = g + xlab("Fitting approach")
g = g + ylab("Residual price")
g

La varianza de los residuos por tanto será la suma al cuadrado de los residuos dividido n

qownnotes-media-VHpgFq

qownnotes-media-VHpgFq

La razón por la que se resta 2 es por quitarle 2 grados de libertad, uno por conocer el final (intercerp) como con la varianza y otro grado de libertad por la covarianza entre x e y

qownnotes-media-QdNeXf

qownnotes-media-QdNeXf

En el primer gráfico vemos la variación en el precio de los diamantes alrededor de la media del precio.

En el gráfico azul vemos la variación frente a la linea de regresión

qownnotes-media-xBrfKR

qownnotes-media-xBrfKR

qownnotes-media-lbWthX

qownnotes-media-lbWthX

qownnotes-media-fWNVhO

qownnotes-media-fWNVhO

R^2

Es el porcentaje total de la variabilidad explicada por la relacción lineal con el predictor

qownnotes-media-ALluas

qownnotes-media-ALluas

En los siguientes ejemplos vamos a ver cómo a pesar de que todo indica una buena realacción lineal, en realidad no la hay

qownnotes-media-ISIpjA

qownnotes-media-ISIpjA

En el primer gráfico se ve que hay mucho ruido, en el segundo que es una curva que no se ajusta, en el tercero hay un outlier que interfiere mucho y en el cuarto hay un punto que hace que se degenere.

qownnotes-media-hvZcHw

qownnotes-media-hvZcHw

Ejercicios

As shown in the slides, residuals are useful for indicating how well data points fit a statistical model. They “can be thought of as the outcome (Y) with the linear association of the predictor (X) removed. One differentiates residual variation (variation after removing the predictor) from systematic variation (variation explained by the regression model).“
It can also be shown that, given a model, the maximum likelihood estimate of the variance of the random error is the average squared residual. However, since our linear model with one predictor requires two parameters we have only (n-2) degrees of freedom. Therefore, to calculate an “average” squared residual to estimate the variance we use the formula 1/(n-2) * (the sum of the squared residuals). If we divided the sum of the squared residuals by n, instead of n-2, the result would give a biased estimate.
To see this we’ll use our favorite Galton height data. First regenerate the regression line and call it fit. Use the R function lm and recall that by default its first argument is a formula such as “child ~ parent” and its second is the dataset, in this case galton.

lm(child ~ parent,galton)

Call: lm(formula = child ~ parent, data = galton)

Coefficients: (Intercept) parent
23.9415 0.6463

That’s not exactly what I’m looking for. Try again. Or, type info() for more options.
Type “fit <- lm(child ~ parent, galton)” at the R prompt.

fit <- lm(child ~ parent, galton)

That’s the answer I was looking for.
First, we’ll use the residuals (fit$residuals) of our model to estimate the standard deviation (sigma) of the error. We’ve already defined n for you as the number of points in Galton’s dataset (928).
Calculate the sum of the squared residuals divided by the quantity (n-2). Then take the square root.

sqrt(sum(fit$residuals^2) / (n - 2)) [1] 2.238547

Great job!
Now look at the “sigma” portion of the summary of fit, “summary(fit)$sigma”.

summary(fit)$sigma [1] 2.238547

All that hard work is paying off!
Pretty cool, huh?
Another cool thing - take the sqrt of “deviance(fit)/(n-2)” at the R prompt.

sqrt(deviance(fit)/(n-2)) [1] 2.238547

Excellent work!
Another useful fact shown in the slides was
Total Variation = Residual Variation + Regression Variation
Recall the beauty of the slide full of algebra which proved this fact. It had a bunch of Y’s, some with hats and some with bars and several summations of squared values. The Y’s with hats were the estimates provided by the model. (They were on the regression line.) The Y with the bar was the mean or average of the data. Which sum of squared term represented Total Variation?

1: Yi_hat-mean(Yi) 2: Yi-mean(Yi) 3: Yi-Yi_hat

Selection: 2

Nice work!
Which sum of squared term represents Residual Variation?

1: Yi_hat-mean(Yi) 2: Yi-Yi_hat 3: Yi-mean(Yi)

Selection: 2

The term R^2 represents the percent of total variation described by the model, the regression variation (the term we didn’t ask about in the preceding multiple choice questions). Also, since it is a percent we need a ratio or fraction of sums of squares. Let’s do this now for our Galton data.
We’ll start with easy steps. Calculate the mean of the children’s heights and store it in a variable called mu. Recall that we reference the childrens’ heights with the expression ‘galton$child’ and the parents’ heights with the expression ‘galton$parent’.

mu <- mean(galton$child)

That’s the answer I was looking for.
Recall that centering data means subtracting the mean from each data point. Now calculate the sum of the squares of the centered children’s heights and store the result in a variable called sTot. This represents the Total Variation of the data.

sTot <- sum((galton$child-mu)^2)

Excellent job!
Now create the variable sRes. Use the R function deviance to calculate the sum of the squares of the residuals. These are the distances between the children’s heights and the regression line. This represents the Residual Variation.

sRes <- deviance(fit)

Perseverance, that’s the answer.
Finally, the ratio sRes/sTot represents the percent of total variation contributed by the residuals. To find the percent contributed by the model, i.e., the regression variation, subtract the fraction sRes/sTot from 1. This is the value R^2.

1 - sRes/sTot [1] 0.2104629

Your dedication is inspiring!
For fun you can compare your result to the values shown in summary(fit)$r.squared to see if it looks familiar. Do this now.

summary(fit)$r.squared [1] 0.2104629

You are doing so well!
To see some real magic, compute the square of the correlation of the galton data, the children and parents. Use the R function cor.

cor(galton\(child,galton\)parent)^2 [1] 0.2104629

Keep working like that and you’ll get there!
We’ll now summarize useful facts about R^2. It is the percentage of variation explained by the regression model. As a percentage it is between 0 and 1. It also equals the sample correlation squared. However, R^2 doesn’t tell the whole story.

Inferencia

Es el proceso de sacar conclusiones acerca de la población desde una muestra. En inferencia estadística, debemos ser siempre conscientes de la incertidumbre en nuestras estimaciones concienzudamente. Los test de hipótesis e intervalos de confianza son las formas más comunes de inferencia estadística.

qownnotes-media-SZlQQz

qownnotes-media-SZlQQz

Podemos calcular todo a mano para ver que todo cuadra

{r}
library(UsingR); data(diamond)
y <- diamond$price; x <- diamond$carat; n <- length(y)
beta1 <- cor(y, x) * sd(y) / sd(x)
beta0 <- mean(y) - beta1 * mean(x)
e <- y - beta0 - beta1 * x
sigma <- sqrt(sum(e^2) / (n-2)) 
ssx <- sum((x - mean(x))^2)
seBeta0 <- (1 / n + mean(x) ^ 2 / ssx) ^ .5 * sigma 
seBeta1 <- sigma / sqrt(ssx)
tBeta0 <- beta0 / seBeta0; tBeta1 <- beta1 / seBeta1
pBeta0 <- 2 * pt(abs(tBeta0), df = n - 2, lower.tail = FALSE)
pBeta1 <- 2 * pt(abs(tBeta1), df = n - 2, lower.tail = FALSE)
coefTable <- rbind(c(beta0, seBeta0, tBeta0, pBeta0), c(beta1, seBeta1, tBeta1, pBeta1))
colnames(coefTable) <- c("Estimate", "Std. Error", "t value", "P(>|t|)")
rownames(coefTable) <- c("(Intercept)", "x")

Comparamos los resultados del calculo a mano y el calculo con lm

{r}
coefTable
fit <- lm(y ~ x); 
summary(fit)$coefficients
qownnotes-media-PhOFwb

qownnotes-media-PhOFwb

Y finalmente hacemos el cálculo del intervalo de confianza de nuestra estimación

qownnotes-media-bsQtRI

qownnotes-media-bsQtRI

Podemos decir por tanto que con un 95% de probabilidad, el incremento de precio sobre 0.1 kilates en diamantes estará entre 355.63 y 388.56

Vamos a generar predicciones para nuevos valores de X al azar

{r, fig.height=5, fig.width==5, echo = FALSE, results='hide'}
library(ggplot2)
newx = data.frame(x = seq(min(x), max(x), length = 100))
p1 = data.frame(predict(fit, newdata= newx,interval = ("confidence")))
p2 = data.frame(predict(fit, newdata = newx,interval = ("prediction")))
p1$interval = "confidence"
p2$interval = "prediction"
p1$x = newx$x
p2$x = newx$x
dat = rbind(p1, p2)
names(dat)[1] = "y"

g = ggplot(dat, aes(x = x, y = y))
g = g + geom_ribbon(aes(ymin = lwr, ymax = upr, fill = interval), alpha = 0.2) 
g = g + geom_line()
g = g + geom_point(data = data.frame(x = x, y=y), aes(x = x, y = y), size = 4)
g
qownnotes-media-HxdQrL

qownnotes-media-HxdQrL

In this lesson, we’ll look at some examples of regression models with more than one variable. We’ll begin with the Swiss data which we’ve taken the liberty to load for you. This data is part of R’s datasets package. It was gathered in 1888, a time of demographic change in Switzerland, and measured six quantities in 47 French-speaking provinces of Switzerland. We used the code from the slides (the R function pairs) to display here a 6 by 6 array of scatterplots showing pairwise relationships between the variables. All of the variables, except for fertility, are proportions of population. For example, “Examination” shows the percentage of draftees receiving the highest mark on an army exam, and “Education” the percentage of draftees with education beyond primary school.
qownnotes-media-zSGDCG

qownnotes-media-zSGDCG

First, use the R function lm to generate the linear model “all” in which Fertility is the variable dependent on all the others. Use the R shorthand “.” to represent the five independent variables in the formula passed to lm. Remember the data is “swiss”.

all <- lm(Fertility ~ ., swiss)

Now look at the summary of the linear model all.

summary(all)

Call: lm(formula = Fertility ~ ., data = swiss)

Residuals: Min 1Q Median 3Q Max -15.2743 -5.2617 0.5032 4.1198 15.3213

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 66.91518 10.70604 6.250 1.91e-07 Agriculture -0.17211 0.07030 -2.448 0.01873
Examination -0.25801 0.25388 -1.016 0.31546
Education -0.87094 0.18303 -4.758 2.43e-05
* Catholic 0.10412 0.03526 2.953 0.00519 Infant.Mortality 1.07705 0.38172 2.822 0.00734

Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ‘’ 1

Residual standard error: 7.165 on 41 degrees of freedom Multiple R-squared: 0.7067, Adjusted R-squared: 0.671 F-statistic: 19.76 on 5 and 41 DF, p-value: 5.594e-10

You are quite good my friend!
Recall that the Estimates are the coefficients of the independent variables of the linear model (all of which are percentages) and they reflect an estimated change in the dependent variable (fertility) when the corresponding independent variable changes. So, for every 1% increase in percent of males involved in agriculture as an occupation we expect a .17 decrease in fertility, holding all the other variables constant; for every 1% increase in Catholicism, we expect a .10 increase in fertility, holding all other variables constant.
The “*" at the far end of the row indicates that the influence of Agriculture on Fertility is significant. At what alpha level is the t-test of Agriculture significant?

1: 0.1 2: 0.05 3: 0.01 4: R doesn’t say

Selection: Enter an item from the menu, or 0 to exit Selection: 2

Excellent work!
Now generate the summary of another linear model (don’t store it in a new variable) in which Fertility depends only on agriculture.

all <- lm(Fertility ~ Agriculture, swiss)

Almost! Try again. Or, type info() for more options.
Type “summary(lm(Fertility ~ Agriculture, swiss))” at the R prompt.

summary(lm(Fertility ~ Agriculture, swiss))

Call: lm(formula = Fertility ~ Agriculture, data = swiss)

Residuals: Min 1Q Median 3Q Max -25.5374 -7.8685 -0.6362 9.0464 24.4858

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 60.30438 4.25126 14.185 <2e-16 ** Agriculture 0.19420 0.07671 2.532 0.0149

Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ‘’ 1

Residual standard error: 11.82 on 45 degrees of freedom Multiple R-squared: 0.1247, Adjusted R-squared: 0.1052 F-statistic: 6.409 on 1 and 45 DF, p-value: 0.01492

All that hard work is paying off!
What is the coefficient of agriculture in this new model?

1: 0.19420 2: 60.30438 3: 0.07671 4: *

Selection: 1

Keep up the great work!
The interesting point is that the sign of the Agriculture coefficient changed from negative (when all the variables were included in the model) to positive (when the model only considered Agriculture). Obviously the presence of the other factors affects the influence Agriculture has on Fertility.
Let’s consider the relationship between some of the factors. How would you expect level Education and performance on an Examination to be related?

1: They would be uncorrelated 2: They would be correlated 3: I would not be able to guess without more information

Selection: 2

You nailed it! Good job!
Now check your intuition with the R command “cor”. This computes the correlation between Examination and Education.

cor(swiss\(Examination,swiss\)Education) [1] 0.6984153

Perseverance, that’s the answer.
The correlation of .6984 shows the two are correlated. Now find the correlation between Agriculture and Education.

cor(swiss\(Agriculture,swiss\)Education) [1] -0.6395225

You are amazing!
The negative correlation (-.6395) between Agriculture and Education might be affecting Agriculture’s influence on Fertility. I’ve loaded and sourced the file swissLMs.R in your working directory. In it is a function makelms() which generates a sequence of five linear models. Each model has one more independent variable than the preceding model, so the first has just one independent variable, Agriculture, and the last has all 5. I’ve tried loading the source code in your editor. If I haven’t done this, open the file manually so you can look at the code.
makelms <- function(){
  # Store the coefficient of linear models with different independent variables
  cf <- c(coef(lm(Fertility ~ Agriculture, swiss))[2], 
          coef(lm(Fertility ~ Agriculture + Catholic,swiss))[2],
          coef(lm(Fertility ~ Agriculture + Catholic + Education,swiss))[2],
          coef(lm(Fertility ~ Agriculture + Catholic + Education + Examination,swiss))[2],
          coef(lm(Fertility ~ Agriculture + Catholic + Education + Examination +Infant.Mortality, swiss))[2])
  print(cf)
}


# Regressor generation process 1.
rgp1 <- function(){
  print("Processing. Please wait.")
  # number of samples per simulation
  n <- 100
  # number of simulations
  nosim <- 1000
  # set seed for reproducability
  set.seed(4321)
  # Point A:
  x1 <- rnorm(n)
  x2 <- rnorm(n)
  x3 <- rnorm(n)
  # Point B:
  betas <- sapply(1 : nosim, function(i)makelms(x1, x2, x3))
  round(apply(betas, 1, var), 5)
}

# Regressor generation process 2.
rgp2 <- function(){
  print("Processing. Please wait.")
  # number of samples per simulation
  n <- 100
  # number of simulations
  nosim <- 1000
  # set seed for reproducability
  set.seed(4321)
  # Point C:
  x1 <- rnorm(n)
  x2 <- x1/sqrt(2) + rnorm(n) /sqrt(2)
  x3 <- x1 * 0.95 + rnorm(n) * sqrt(1 - 0.95^2)
  # Point D:
  betas <- sapply(1 : nosim, function(i)makelms(x1, x2, x3))
  round(apply(betas, 1, var), 5)
}

TEST2

qownnotes-media-FATtcx

qownnotes-media-FATtcx

qownnotes-media-FRaEUb

qownnotes-media-FRaEUb

qownnotes-media-vvKXCO

qownnotes-media-vvKXCO

qownnotes-media-vvjBvz

qownnotes-media-vvjBvz

qownnotes-media-ytKPiJ

qownnotes-media-ytKPiJ

qownnotes-media-GXWwfn

qownnotes-media-GXWwfn

qownnotes-media-BwKEvy

qownnotes-media-BwKEvy

qownnotes-media-NZjjlP

qownnotes-media-NZjjlP

Regresión multivariable

Cuando se hace un análisis multivariables, es muy importante escoger qué variables van a intervenir en el modelo, y de esa manera eliminar variables de confusión que pueden hacer que la regresón que estamos llevando a cabo sea errónea.

Vamos a comparar una regresión entre la fertilidad comparada con el resto de variables, nos vamos a fijar en su relacción con la agricultura, y luego vamos a ver su relacción excluyendo el resto de variables para ver cómo difiere. Esta todo en porcentajes.

qownnotes-media-AOeeST

qownnotes-media-AOeeST

qownnotes-media-uZewce

qownnotes-media-uZewce

Se puede hacer gráficamente de dos maneras

qownnotes-media-KWvukl

qownnotes-media-KWvukl

qownnotes-media-NjyqIW

qownnotes-media-NjyqIW

qownnotes-media-dIhRXq

qownnotes-media-dIhRXq

Vamos a hacerlo con lm

qownnotes-media-AjSkFc

qownnotes-media-AjSkFc

qownnotes-media-FXaRjc

qownnotes-media-FXaRjc

El t value no es más que dividir el estimate/std Errror, y la última columna nos da un p value two sided que nos indica si es estadísticamente significativo

Ahora vamos a analizar la fertilidad y agricultura exclusivamente

qownnotes-media-rnBQSz

qownnotes-media-rnBQSz

Los resultados son muy distintos, pero incluso me sigue diciendo que es estadisticamente significativo.

Este estudio es algo dinámico, tendremos que ir quitando y poniendo varables para probar su efecto en el modelo.

qownnotes-media-nplXCb

qownnotes-media-nplXCb

lm normalmente nos devuelva NA es porque esa variable es una combinación lineal de otras y no aporta ningún valor al modelo.

COMPARAR AGRUPACIONES EN VARIABLE CATEGORICA

Ahora vamos a ver cómo se puede usar lm para comparar resultados frente a una variable factor

En este caso vamos a usar el dataset de insectos, con el spray correspondiente y el número de insectors que mata

qownnotes-media-OvnTJV

qownnotes-media-OvnTJV

qownnotes-media-UBmJtf

qownnotes-media-UBmJtf

Si no indicamos nada, se toma como referencia del factor el primer grupo, en este caso A, que actuará como Intercep

Linear model fit, group A is the reference

{r, echo= TRUE}
summary(lm(count ~ spray, data = InsectSprays))$coef

Aquí lo que estamos obteniendo son el el coficiente la diferencia con la media referencia (intercept) y el Pr(t) nos dirá si la diferencia entre los grupos es estadísticamente significativa (vemos en el caso de A y B que no, por ejemplo, como se puede ver en el boxplot).

Aquí estamos asumiendo que estamos tratando con una distribución normal y que la varianza entre los grupos es homogénea, cosa que no es cierta (los valores son siempre mayores que 0) por lo que aquí el pvalue de ttest no será válido, pero nos sirve para ver el ejemplo. Para contar datos será más apropiada Poisson GLM que veremos más adelante.

qownnotes-media-ffKRON

qownnotes-media-ffKRON

Siempre tenemos que comparar con algo, si queremos que sea otra la referencia tenemos que hacer relevel

Reordering the levels

{r}
spray2 <- relevel(InsectSprays$spray, "C")
summary(lm(count ~ spray2, data = InsectSprays))$coef
qownnotes-media-DpIsru

qownnotes-media-DpIsru

Y qué pasaría si omitimos el intercept? En este caso no compararíamos los grupos entre sí y lo estaríamos comparando con 0 como referencia, por tanto los pvalues sería si la diferencia con 0 es estadísticamente significativa, y en los coeficientes únicamente las medias de cada grupo.

What if we omit the intercept?

{r, echo= TRUE}
summary(lm(count ~ spray - 1, data = InsectSprays))$coef
unique(ave(InsectSprays$count, InsectSprays$spray))
qownnotes-media-xINqMe

qownnotes-media-xINqMe

Líneas de regresión por grupos

Vamos a investigar qué pasa si dos variables que estamos usando interfieren una con la otra. Para simplificarlo usaremos una variable Bimodal en base a la religion, 1 catolico 0 protestante, y vamos a sacar una línea de regresión por cada tipo.

qownnotes-media-KcMfBL

qownnotes-media-KcMfBL

qownnotes-media-kSzOJt

qownnotes-media-kSzOJt

Vamos a sacar una linea de regresión para las provincias con mayoría católica y otra línea para las provincias con mayoría protestante.

Si las variables no interfieren entre ellas tendremos 2 líneas con distinto intercept y misma pendiente:

Modelo 1 .- no se tiene en cuenta religion

qownnotes-media-UuDtzx

qownnotes-media-UuDtzx

qownnotes-media-yxLdSz

qownnotes-media-yxLdSz

qownnotes-media-jSVfeC

qownnotes-media-jSVfeC

Modelo 2 .- Los grupos de religión no interfieren entre sí

qownnotes-media-LojhDs

qownnotes-media-LojhDs

qownnotes-media-RsIVbn

qownnotes-media-RsIVbn

Especificamos que sea factor, porque hay veces que los 0,1,2 los identifica como coeficientes crecientes, que cada valor corresponde al doble del inmediatamente anterior

qownnotes-media-sHVqoU

qownnotes-media-sHVqoU

qownnotes-media-CYrSAu

qownnotes-media-CYrSAu

Modelo 3.- Los grupos de religión sí interfieren

qownnotes-media-lnCMeq

qownnotes-media-lnCMeq

qownnotes-media-VKTKln

qownnotes-media-VKTKln

Esto lo hacemos con el * , de modo que el primer intercept será de los no catolicos, y el tercer estimador será el intercept de los catolicos, mientras que el segundo y el cuarto serán las pendientes de los no catolicos y catolicos

qownnotes-media-XFrChM

qownnotes-media-XFrChM

qownnotes-media-nokyRe

qownnotes-media-nokyRe

This is the third and final lesson in which we’ll look at regression models with more than one independent variable or predictor. We’ll begin with WHO hunger data which we’ve taken the liberty to load for you. WHO is the World Health Organization and this data concerns young children from around the world and rates of hunger among them which the organization compiled over a number of years. The original csv file was very large and we’ve subsetted just the rows which identify the gender of the child as either male or female. We’ve read the data into the data frame “hunger” for you, so you can easily access it.

|======= | 5%

As we did in the last lesson let’s first try to get a better understanding of the dataset. Use the R function dim to find the dimensions of hunger.

dim(hunger) [1] 948 13

Keep working like that and you’ll get there!

|=========== | 8%

How many samples does hunger have?

948 [1] 948

That’s the answer I was looking for.

|============== | 11%

Now use the R function names to find out what the 13 columns of hunger represent.

names(hunger) [1] “X” “Indicator” “Data.Source” “PUBLISH.STATES” “Year” “WHO.region” “Country” “Sex”
[9] “Display.Value” “Numeric” “Low” “High” “Comments”

That’s a job well done!

|================== | 14%

The Numeric column for a particular row tells us the percentage of children under age 5 who were underweight when that sample was taken. This is one of the columns we’ll be focussing on in this lesson. It will be the outcome (dependent variable) for the models we generate.

|===================== | 16%

Let’s first look at the rate of hunger and see how it’s changed over time. Use the R function lm to generate the linear model in which the rate of hunger, Numeric, depends on the predictor, Year. Put the result in the variable fit.

fit <- lm(Numeric ~ Year,hunger)

You are amazing!

|========================= | 19%

Now look at the coef portion of the summary of fit.

summary(fit)$coef Estimate Std. Error t value Pr(>|t|) (Intercept) 634.479660 121.1445995 5.237375 2.007699e-07 Year -0.308397 0.0605292 -5.095012 4.209412e-07

Great job!

|============================ | 22%

What is the coefficient of hunger$Year?

1: -0.30840 2: 634.47966 3: 121.14460 4: 0.06053

Selection: 1

All that practice is paying off!

|================================ | 24%

What does the negative Estimate of hunger$Year show?

1: As time goes on, the rate of hunger increases 2: I haven’t a clue 3: As time goes on, the rate of hunger decreases

Selection: 3

Perseverance, that’s the answer.

|=================================== | 27%

What does the intercept of the model represent?

1: the percentage of hungry children at year 0 2: the number of hungry children at year 0 3: the number of children questioned in the survey

Selection: 1

You’re the best!

|======================================= | 30%

Now let’s use R’s subsetting capability to look at the rates of hunger for the different genders to see how, or even if, they differ. Once again use the R function lm to generate the linear model in which the rate of hunger (Numeric) for female children depends on Year. Put the result in the variable lmF. You’ll have to use the R construct x[hunger$Sex==“Female”] to pick out both the correct Numerics and the correct Years.

fit <- lm(Numeric ~ Year,hunger[hunger$Sex==“Female”]) Error in [.data.frame(hunger, hunger\(Sex == "Female") : undefined columns selected fit <- lm(Numeric ~ Year,hunger[,hunger\)Sex==“Female”]) Error in [.data.frame(hunger, , hunger$Sex == “Female”) : undefined columns selected fit <- lm(Numeric ~ Year,hunger)

You seem to have misspelled the model’s name. I was expecting lmF but you apparently typed fit.
You almost had it, but not quite. Try again. Or, type info() for more options.
Type lmF <- lm(hunger\(Numeric[hunger\)Sex==“Female”] ~ hunger\(Year[hunger\)Sex==“Female”]) at the R prompt or more simply lmF <- lm(Numeric[Sex==“Female”] ~ Year[Sex==“Female”],hunger)

lmF <- lm(hunger\(Numeric[hunger\)Sex==“Female”] ~ hunger\(Year[hunger\)Sex==“Female”])

That’s the answer I was looking for.

|========================================== | 32%

Do the same for male children and put the result in lmM.

lmM <- lm(hunger\(Numeric[hunger\)Sex==“Male”] ~ hunger\(Year[hunger\)Sex==“Male”])

That’s the answer I was looking for.

|============================================== | 35%

Now we’ll plot the data points and fitted lines using different colors to distinguish between males (blue) and females (pink).
qownnotes-media-tBfMNp

qownnotes-media-tBfMNp

We can see from the plot that the lines are not exactly parallel. On the right side of the graph (around the year 2010) they are closer together than on the left side (around 1970). Since they aren’t parallel, their slopes must be different, though both are negative. Of the following R expressions which would confirm that the slope for males is negative?

1: lmM\(coef[1] 2: lmF\)coef[2] 3: lmM$coef[2]

Selection: 3

Your dedication is inspiring!

|===================================================== | 41%

Now instead of separating the data by subsetting the samples by gender we’ll use gender as another predictor to create the linear model lmBoth. Recall that to do this in R we place a plus sign “+” between the independent variables, so the formula looks like dependent ~ independent1 + independent2.

|======================================================== | 43%

Create lmBoth now. Numeric is the dependent, Year and Sex are the independent variables. The data is “hunger”. For lmBoth, make sure Year is first and Sex is second.

lmBoth <- lm(Numeric ~ Year+Sex,hunger)

You are doing so well!

|============================================================ | 46%

Now look at the summary of lmBoth with the R command summary.

summary(lmBoth)

Call:
lm(formula = Numeric ~ Year + Sex, data = hunger)

Residuals:
    Min      1Q  Median      3Q     Max 
-25.472 -11.297  -1.848   7.058  45.990 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 633.5283   120.8950   5.240 1.98e-07 ***
Year         -0.3084     0.0604  -5.106 3.99e-07 ***
SexMale       1.9027     0.8576   2.219   0.0267 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 13.2 on 945 degrees of freedom
Multiple R-squared:  0.03175,   Adjusted R-squared:  0.0297 
F-statistic: 15.49 on 2 and 945 DF,  p-value: 2.392e-07
Keep working like that and you’ll get there!

|=============================================================== | 49%

Notice that three estimates are given, the intercept, one for Year and one for Male. What happened to the estimate for Female? Note that Male and Female are categorical variables hence they are factors in this model. Recall from the last lesson (and slides) that R treats the first (alphabetical) factor as the reference and its estimate is the intercept which represents the percentage of hungry females at year 0. The estimate given for the factor Male is a distance from the intercept (the estimate of the reference group Female). To calculate the percentage of hungry males at year 0 you have to add together the intercept and the male estimate given by the model.

|=================================================================== | 51%

What percentage of young Males were hungry at year 0?

1: 635.431 2: 633.2199 3: 1.9027 4: I can’t tell since the data starts at 1970.

Selection: 3

Try again. Getting it right on the first try is boring anyway!
The intercept is the percentage of females hungry at year 0 and the intercept plus hunger$SexMale is the percentage of males hungry at year 0.

1: I can’t tell since the data starts at 1970. 2: 635.431 3: 633.2199 4: 1.9027

Selection: 2

You got it right!

|====================================================================== | 54%

What does the estimate for hunger$Year represent?

1: the annual decrease in percentage of hungry males 2: the annual decrease in percentage of hungry children of both genders 3: the annual decrease in percentage of hungry females

Selection: 2

That’s a job well done!

|========================================================================== | 57%

Now we’ll replot the data points along with two new lines using different colors. The red line will have the female intercept and the blue line will have the male intercept.
qownnotes-media-BCnmqs

qownnotes-media-BCnmqs

|============================================================================= | 59%

The lines appear parallel. This is because

1: they have the same slope 2: I have no idea 3: they have slopes that are very close

Selection: 1

Excellent job!

|================================================================================= | 62%

Now we’ll consider the interaction between year and gender to see how that affects changes in rates of hunger. To do this we’ll add a third term to the predictor portion of our model formula, the product of year and gender.

|==================================================================================== | 65%

Create the model lmInter. Numeric is the outcome and the three predictors are Year, Sex, and Sex*Year. The data is “hunger”.

lmInter <- lm(Numeric ~ Year,Sex,Sex*Year,hunger) Error in is.data.frame(data) : object ‘Sex’ not found lmBoth <- lm(Numeric ~ Year+Sex,hunger)

You seem to have misspelled the model’s name. I was expecting lmInter but you apparently typed lmBoth.
You’re close…I can feel it! Try it again. Or, type info() for more options.
Type lmInter <- lm(hunger\(Numeric ~ hunger\)Year + hunger\(Sex + hunger\)Year * hunger$Sex) or lmInter <- lm(Numeric ~ Year + Sex + Year*Sex, hunger)

lmInter <- lm(Numeric ~ Year + Sex +Year*Sex, hunger)

Excellent work!

|======================================================================================== | 68%

Now look at the summary of lmInter with the R command summary.

summary(lmInter)

Call:
lm(formula = Numeric ~ Year + Sex + Year * Sex, data = hunger)

Residuals:
    Min      1Q  Median      3Q     Max 
-25.913 -11.248  -1.853   7.087  46.146 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  603.50580  171.05519   3.528 0.000439 ***
Year          -0.29340    0.08547  -3.433 0.000623 ***
SexMale       61.94772  241.90858   0.256 0.797946    
Year:SexMale  -0.03000    0.12087  -0.248 0.804022    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 13.21 on 944 degrees of freedom
Multiple R-squared:  0.03181,   Adjusted R-squared:  0.02874 
F-statistic: 10.34 on 3 and 944 DF,  p-value: 1.064e-06
Excellent work!

|=========================================================================================== | 70%

What is the percentage of hungry females at year 0?

1: The model doesn’t say. 2: 61.94772 3: 603.5058

Selection: 3

Your dedication is inspiring!

|=============================================================================================== | 73%

What is the percentage of hungry males at year 0?

1: 61.94772 2: 665.4535 3: 603.5058 4: The model doesn’t say.

Selection: 2

Excellent work!

|================================================================================================== | 76%

What is the annual change in percentage of hungry females?

1: -0.03000 2: 0.08547 3: The model doesn’t say. 4: -0.29340

Selection: 4

You are amazing!

|====================================================================================================== | 78%

What is the annual change in percentage of hungry males?

1: 0.12087 2: -0.03000 3: -0.32340 4: The model doesn’t say.

Selection: 2

Not exactly. Give it another go.
The estimate associated with Year:SexMale represents the distance of the annual change in percent of males from that of females.

1: The model doesn’t say. 2: 0.12087 3: -0.03000 4: -0.32340

Selection: 4

You are quite good my friend!

|========================================================================================================= | 81%

Now we’ll replot the data points along with two new lines using different colors to distinguish between the genders.
qownnotes-media-yXmRTO

qownnotes-media-yXmRTO

|============================================================================================================= | 84%

Which line has the steeper slope?

1: Female 2: Male 3: They look about the same

Selection: 1

Not quite! Try again.
The lines are not parallel and will eventually intersect. The line that is further from horizontal (which has slope 0) has a steeper slope and indicates a faster rate of change. Which line has a slope further from 0?

1: They look about the same 2: Male 3: Female

Selection: 2

All that hard work is paying off!

|================================================================================================================ | 86%

Finally, we note that things are a little trickier when we’re dealing with an interaction between predictors which are continuous (and not factors). The slides show the underlying algebra, but we can summarize.

|==================================================================================================================== | 89%

Suppose we have two interacting predictors and one of them is held constant. The expected change in the outcome for a unit change in the other predictor is the coefficient of that changing predictor + the coefficient of the interaction * the value of the predictor held constant.

|======================================================================================================================= | 92%

Suppose the linear model is Hi = b0 + (b1Ii) + (b2Yi)+ (b3IiYi) + ei. Here the H’s represent the outcomes, the I’s and Y’s the predictors, neither of which is a category, and the b’s represent the estimated coefficients of the predictors. We can ignore the e’s which represent the residuals of the model. This equation models a continuous interaction since neither I nor Y is a category or factor. Suppose we fix I at some value and let Y vary.

|=========================================================================================================================== | 95%

Which expression represents the change in H per unit change in Y given that I is fixed at 5?

1: b0+b2 2: b2+b3Y 3: b1+5b3 4: b2+b3*5

Selection: 4

You are quite good my friend!

|============================================================================================================================== | 97%

Congratulations! You’ve finished this final lesson in multivariable regression models.

Ajustes y relacción entre variables

El Ajuste es la idea de poner regresores dentro del modelo lineal e investigar el role de una tercera variable en la relación entre los otros dos, ya que es habitual el caso en el que esta tercera variable pueda causar distorsión o confusión en las otras

Como ejemplo, podemos considerar los porcentajes de cancer de pulmón y los usos de pastillas de menta. Estamos mirando el volumen expiratorio forzado y el consumo de pastillas de menta. Si encontráramos significancia estadística en la relacción de la regresión, no sería inteligente ir rápidamente a los periódicos y decir que el consumo de pastillas de menta baja la capacidad pulmonar. En primer lugar, auqnue la relacción es sólida, no se sabe si es causal. Pero más importante en este caso, la verdadera razón son los hábitos tabáquicos. Fumar está muy relaccionado con ambas cosas, capacidad pulmonar y consumo de pastillas de menta. Cómo te defiendes de la acusación de que es variabilidad en los hábitos como fumador?

Si el hallazgo apareciera entre fumadores y no fumadores de manera separada, entonces se podría tener algo. La gente no lo creería a menos que eliminásemos la constante fumador. Esta es la idea de añadir una variable de regresión como ajuste. El coeficiente de interés es interpretado como el efecto del predictor en la respuesta, manteniendo ajuste de la variable constante.

Buscar los ajustes es una tarea complicada, ya que se trata de ir haciendo pruebas y comprobando resultados, muchas veces teniendo que recurrir a un experto en el sector para que nos aconseje.

A continuación camos a ver mediante ejemplos, cómo puede influir o no un ajuste mediante la comparación de dos grupos, digamos que unos reciben tratamiento y otros no lo reciben.

EJEMPLO 1

Las líneas horizontales son las medias de los grupos, mientras que el resto son las regresiones lineales teniendo en cuenta los grupos y sin tenerlos en cuenta.

qownnotes-media-DJgAwG

qownnotes-media-DJgAwG

EJEMPLO2

qownnotes-media-CPMPro

qownnotes-media-CPMPro

EJEMPLO 3

La asociación marginal es la diferencia entre las medias

qownnotes-media-mkldqL

qownnotes-media-mkldqL

EJEMPLO 4

qownnotes-media-Tjioqp

qownnotes-media-Tjioqp

EJEMPLO 5

qownnotes-media-uRllAi

qownnotes-media-uRllAi

EJEMPLO 6

Metemos una tercera variable representada por el color

qownnotes-media-GwKpZj

qownnotes-media-GwKpZj

Si representáramos estos puntos en 3 dimensiones, nos daríamos cuenta que la Z forma un plano perfectamente alineado

Residuos y Diagnósticos de outliers

Como hemos visto con anterioridad, los residuos son las diferencias entre los valores del dataset y los calculados por nuestro modelo de regresión. Estudiar estos residuos nos va a ayudar a ver si nuestro modelo es bueno o no

Para explicar la influencia de determinados puntos en el modelo vamos a exponer distintas situaciones en las que tendríamos que decidir si un punto es un outlier y sacarlo del modelo o no. Partimos de una serie de puntos con una clara relacción lineal y una serie de puntos rojos sobre los que vamos a ver cómo impactaría su inclusión en el dataset. Vamos a analizar el peso y la capacidad que tiene para cambiar el modelo con ese peso (influencia). Se puede pensar en esto como si fuera un balancín, el mismo peso tendrá un efecto distinto si se posa cerca del eje que si lo ponemos en el extremo del balancín.

qownnotes-media-jCNAIa

qownnotes-media-jCNAIa

Abajo izquierda: Está justo en medio de todos los puntos por lo que tiene nulo peso al estar justo en el centro y va a tener nula influencia al estar perfectamente alineado

Arriba a la derecha: Tiene mucho peso por al estar tan alejado del centro, pero al estar perfectamente aliineado no tendá apenas influencia.

Abajo derecha: Tendría mucha influencia y mucho peso , ya que está muy alejado del centro y no está alineado en absoluto.

Arriba izquierda: En este caso está totalmente alineado con el centro pero muy desalineado con la recta de regresión, con lo que no tendrá tanta influencia.

qownnotes-media-ZaMwrE

qownnotes-media-ZaMwrE

qownnotes-media-CSVMBA

qownnotes-media-CSVMBA

R propociona diversas herramientas para hacer este análisis:

Lo malo de los residuos es que tienen unidad de medida, la misma que las Y, así que las dos primeras herramientas lo que hacer es estandarizarlas.

qownnotes-media-PuPqrM

qownnotes-media-PuPqrM

qownnotes-media-skzXBX

qownnotes-media-skzXBX

Vamos a estudiar casos:

Caso 1

qownnotes-media-TyISYc

qownnotes-media-TyISYc

qownnotes-media-anTkny

qownnotes-media-anTkny

qownnotes-media-VaVlKp

qownnotes-media-VaVlKp

Vemos que en este caso df betas del punto 10,10 es mucho más largo que el resto al estar muy poco alineado con el resto, así como el hatvalue que nos indica que esta muy alejado del centro, con lo que aquí decidiríamos que es un outlier. Tendría mucha influencia

Caso 2

qownnotes-media-PYgHwp

qownnotes-media-PYgHwp

qownnotes-media-ypsYyK

qownnotes-media-ypsYyK

Aquí vemos que las dfbetas no es muy largo con respecto al resto al estar muy bien alineado, sin embargo el hatvalue sí que es muy grande, al estar muy lejos del centro. No tendá demasiada influencia

Existe una funcionalidad en R que nos ayuda a visualizar varias cosas acerca de los residuos y que nos ayudan a decidir si el modelo encaja o no es acertado:

qownnotes-media-oRbgEj

qownnotes-media-oRbgEj

El primer cuadro: Aquí vemos si hay patrones en los residuos respecto los valores estimados, podríamos ver por ejemplo heteroscaticidad

The simplest diagnostic plot displays residuals versus fitted values. Residuals should be uncorrelated with the | fit, independent and (almost) identically distributed with mean zero. In this case there is a linear pattern involving all but one residual and the fit.

plot(lm(y ~ x, out2), which=1)

qownnotes-media-UjuEUC

qownnotes-media-UjuEUC

Our influential outlier is in row 1 of the data. To exclude it is just a matter using out2[-1, ] rather than out2 as data

plot(lm(y ~ x, out2[-1,]), which=1)

qownnotes-media-QuRHSf

qownnotes-media-QuRHSf

This plot has none of the patterned appearance of the first. It looks as we would expect if residuals were independently and (almost) identically distributed with zero mean, and were uncorrelated with the fit.

The change which inclusion or exclusion of a sample induces in coefficents is a simple measure of its influence. Subtract coef(fitno) from coef(fit) to see the change induced by including the influential first sample.

coef((lm(y ~ x, out2)) - coef((lm(y ~ x, out2[-1,])) (Intercept) x -0.01167866 -0.53363019

qownnotes-media-sSRaTe

qownnotes-media-sSRaTe

qownnotes-media-PJEUic

qownnotes-media-PJEUic

When a sample is included in a model, it pulls the regression line closer to itself (orange line) than that of the model which excludes it (black line.) Its residual, the difference between its actual y value and that of a regression line, is thus smaller in magnitude when it is included (orange dots) than when it is omitted (black dots.) The ratio of these two residuals, orange to black, is therefore small in magnitude for an influential sample. For a sample which is not influential the ratio would be close to 1. Hence, 1 minus the ratio is a measure of influence, near 0 for points which are not influential, and near 1 for points which are.
qownnotes-media-nhIAkH

qownnotes-media-nhIAkH

This measure is sometimes called influence, sometimes leverage, and sometimes hat value. Since it is 1 minus the ratio of two residuals, to calculate it from scratch we must first obtain the two residuals. The ratio’s numerator (orange dots) is the residual of the first sample of the model we called fit. The model fitno, which excludes this sample, also excludes its residual, so we will have to calculate its value. This is easily done. We use R’s predict function to calculate fitno’s predicted value of y and subtract it from the actual value. Use the expression resno <- out2[1, “y”] - predict(fitno, out2[1,]) to do the calculation.

resno <- out2[1, “y”] - predict(fitno, out2[1,])

Now calculate the influence of our outlier using 1-resid(fit)[1]/resno or an equivalent expression.

1-resid(fit)[1]/resno 1 0.6311547

hatvalues: The function, hatvalues, performs for every sample a calculation equivalent to the one you’ve just done. Thus the first entry of hatvalues(fit) should match the value which you have just calculated. Since there are quite a few samples, use head(hatvalues(fit)) or View(hatvalues(fit)) to compare the influence measure of our outlier to that of some other samples.
qownnotes-media-QbzaEE

qownnotes-media-QbzaEE

qownnotes-media-lkikYx

qownnotes-media-lkikYx

sigma <- sqrt(deviance(fit)/df.residual(fit))

qownnotes-media-wNuUSr

qownnotes-media-wNuUSr

rstd <- resid(fit)/(sigma * sqrt(1-hatvalues(fit)))

qownnotes-media-UrFVcY

qownnotes-media-UrFVcY

El segundo cuadro: Se usa para evaluar si los errorres siguen una distribución normal

qownnotes-media-bpEAwS

qownnotes-media-bpEAwS

qownnotes-media-DIQlWY

qownnotes-media-DIQlWY

Look at the outlier’s standardized residual, labeled 1 on the Normal QQ plot, it is almost 5 standar deviation from the mean

El tercer cuadro: Es igual que el primero pero con los datos escalados, lo que nos permitiría compararlo con otros directamente in preocuparnos de las undades

qownnotes-media-BVZMqo

qownnotes-media-BVZMqo

qownnotes-media-bxkQXo

qownnotes-media-bxkQXo

qownnotes-media-uAIqZk

qownnotes-media-uAIqZk

El cuarto cuadro: Este nos muestra el peso frente a la influencia de cada punto

Studentized residuals, (sometimes called externally Studentized residuals,) estimate the standard deviations of individual residuals using, in addition to individual hat values, the deviance of a model which leaves the associated sample out. We’ll illustrate using the outlier. Recalling that the model we called fitno omits the outlier sample, calculate the sample standard deviation of fitno’s residual by dividing its deviance, by its residual degrees of freedom and taking the square root. Store the result in a variable called sigma1.

sigma1 <- sqrt(deviance(fitno)/df.residual(fitno))

Calculate the Studentized residual for the outlier sample by dividing resid(fit)[1] by the product of sigma1 and | sqrt(1-hatvalues(fit)[1]). There is no need to store this in a variable.

qownnotes-media-OjOrwb

qownnotes-media-OjOrwb

qownnotes-media-cDYWdG

qownnotes-media-cDYWdG

Recall that we calculated the sample standard deviation of fit’s residual, sigma, earlier. Divide the summed squares of dy by 2*sigma^2 to calculate the outlier’s Cook’s distance. There is no need to store the result in a variable.
qownnotes-media-oQgkYI

qownnotes-media-oQgkYI

qownnotes-media-sNOUMr

qownnotes-media-sNOUMr

Selección del Modelo (VIF y ANOVA)

Para elegir el modelo correcto de predicción y elegir las variables adecuadas no existe una regla, habrá que ir mirado cada caso e ir tomando decisiones.

Quitar varables importantes, nos va dar resultados sesgados, mientras que meter variables de más no va a inferir en los resultados de la predicción pero lo que produce es incrementar el error estándar.

qownnotes-media-NkyPQS

qownnotes-media-NkyPQS

En este primer caso vemos que añadir variables al modelo, no impacta demasido en los resultados, nos vamos a basar en el primer método que es VIF

VIF

VARIANCE INFLATION FACTORS

Es el incremento de la varianza por el regresor i comparado con la situación ideal donde es ortogonal con el resto de regresores. La raíz cuadrada del VIF es el incremento en el sd. Hay que recordar la inflación de la varianza es sólo una parte del cuadro, queremos incluir ciertas variables incluso si aumentan dramáticamente nuestra varianza.

qownnotes-media-iQcjNt

qownnotes-media-iQcjNt

En este segundo ejempo en el que ponemos una gran dependencia entre la variables se ve claramente cómo varia:

qownnotes-media-KsHtiM

qownnotes-media-KsHtiM

qownnotes-media-kkKeGA

qownnotes-media-kkKeGA

qownnotes-media-mmNOIM

qownnotes-media-mmNOIM

qownnotes-media-ZOcqNv

qownnotes-media-ZOcqNv

En cuanto al aumento del error estándar:

qownnotes-media-fXblbs

qownnotes-media-fXblbs

Ejercicios

In modeling, our interest lies in parsimonious, interpretable representations of the | data that enhance our understanding of the phenomena under study. Omitting variables | results in bias in the coefficients of interest - unless their regressors are | uncorrelated with the omitted ones. On the other hand, including any new variables | increases (actual, not estimated) standard errors of other regressors. So we don’t | want to idly throw variables into the model. This lesson is about the second of | these two issues, which is known as variance inflation.

We shall use simulations to illustrate variance inflation. The source code for these simulations is in a file named vifSims.R which I have copied into your working directory and tried to display in your source code editor. If I’ve failed to display it, you should open it manually.
makelms <- function(x1, x2, x3){
  # Simulate a dependent variable, y, as x1
  # plus a normally distributed error of mean 0 and 
  # standard deviation .3.
  y <- x1 + rnorm(length(x1), sd = .3)
  # Find the coefficient of x1 in 3 nested linear
  # models, the first including only the predictor x1,
  # the second x1 and x2, the third x1, x2, and x3.
  c(coef(lm(y ~ x1))[2], 
    coef(lm(y ~ x1 + x2))[2], 
    coef(lm(y ~ x1 + x2 + x3))[2])
}
Find the function, makelms, at the top of vifSims.R. The final expression in makelms creates 3 linear models. The first, lm(y ~ x1), predicts y in terms of x1, the second predicts y in terms of x1 and x2, the third in terms of all three regressors. The second coefficient of each model, for instance coef(lm(y ~ x1))[2], is extracted and returned in a 3-long vector. What does this second coefficient represent?

1: The coefficient of x2. 2: The coefficient of the intercept. 3: The coefficient of x1.

Selection: 3

Great job!
In makelms, the simulated dependent variable, y, depends on which of the regressors?

1: x1 and x2 2: x1 3: x1, x2, and x3

Selection: 2

That’s the answer I was looking for.

|============ | 21%

# Regressor generation process 1.
rgp1 <- function(){
  print("Processing. Please wait.")
  # number of samples per simulation
  n <- 100
  # number of simulations
  nosim <- 1000
  # set seed for reproducibility
  set.seed(4321)
  # Point A
  x1 <- rnorm(n)
  x2 <- rnorm(n)
  x3 <- rnorm(n)
  # Point B
  betas <- sapply(1 : nosim, function(i)makelms(x1, x2, x3))
  round(apply(betas, 1, var), 5)
}

# Regressor generation process 2.
rgp2 <- function(){
  print("Processing. Please wait.")
  # number of samples per simulation
  n <- 100
  # number of simulations
  nosim <- 1000
  # set seed for reproducibility
  set.seed(4321)
  # Point C
  x1 <- rnorm(n)
  x2 <- x1/sqrt(2) + rnorm(n) /sqrt(2)
  x3 <- x1 * 0.95 + rnorm(n) * sqrt(1 - 0.95^2)
  # Point D
  betas <- sapply(1 : nosim, function(i)makelms(x1, x2, x3))
  round(apply(betas, 1, var), 5)
}
In vifSims.R, find the functions, rgp1() and rgp2(). Both functions generate 3 regressors, x1, x2, and x3. Compare the lines following the comment Point A in rgp1() with those following Point C in rgp2(). Which of the following statements about x1, x2, and x3 is true?

1: x1, x2, and x3 are correlated in both rgp1() and rgp2(). 2: x1, x2, and x3 are uncorrelated in both rgp1() and rgp2(). 3: x1, x2, and x3 are uncorrelated in rgp1(), but not in rgp2(). 4: x1, x2, and x3 are correlated in rgp1(), but not in rgp2().

Selection: 2

You are doing so well!

|============== | 25%

In the line following Point B in rgp1(), the function maklms(x1, x2, x3) is applied 1000 times. Each time it is applied, it simulates a new dependent variable, y, and returns estimates of the coefficient of x1 for each of the 3 models, y ~ x1, y ~ x1 + x2, and y ~ x1 + x2 + x3. It thus computes 1000 estimates of the 3 coefficients, collecting the results in 3x1000 array, beta. In the next line, the expression, apply(betas, 1, var), does which of the following?

1: Computes the variance of each row. 2: Computes the variance of each column.

Selection: 1

All that practice is paying off!
The function rgp1() computes the variance in estimates of the coefficient of x1 in each of the three models, y ~ x1, y ~ x1 + x2, and y ~ x1 + x2 + x3. (The results are rounded to 5 decimal places for convenient viewing.) This simulation approximates the variance (i.e., squared standard error) of x1’s coefficient in each of these three models. Recall that variance inflation is due to correlated regressors and that in rgp1() the regressors are uncorrelated. Run the simulation rgp1() now. Be patient. It takes a while.

rgp1() [1] “Processing. Please wait.” x1 x1 x1 0.00110 0.00111 0.00112

Nice work!
The variances in each of the three models are approximately equal, as expected, since the other regressors, x2 and x3, are uncorrelated with the regressor of interest, x1. However, in rgp2(), x2 and x3 both depend on x1, so we should expect an effect. From the expressions assigning x2 and x3 which follow Point C, which is more strongly correlated with x1?
In vifSims.R, look at the lines following Point C again, and note that 1/sqrt(2) in the expression for x2 is much smaller than 0.95 in the expression for x3.

1: x2 2: x3

Selection: 2

Great job!

|===================== | 38%

Run rgp2() to simulate standard errors in the coefficient of x1 for cases in which x1 is correlated with the other regressors

rgp2() [1] “Processing. Please wait.” x1 x1 x1 0.00110 0.00240 0.00981

Nice work!

In this case, variance inflation due to correlated regressors is clear, and is most | pronounced in the third model, y ~ x1 + x2 + x3, since x3 is the regressor most | strongly correlated with x1.

|========================== | 46%

In these two simulations we had 1000 samples of estimated coefficients, hence could calculate sample variance in order to illustrate the effect. In a real case, we have only one set of coefficients and we depend on theoretical estimates. However, theoretical estimates contain an unknown constant of proportionality. We therefore depend on ratios of theoretical estimates called Variance Inflation Factors, or VIFs.
A variance inflation factor (VIF) is a ratio of estimated variances, the variance due to including the ith regressor, divided by that due to including a corresponding ideal regressor which is uncorrelated with the others. VIF’s can be calculated directly, but the car package provides a convenient method for the purpose as we will illustrate using the Swiss data from the datasets package.

|=============================== | 54%

According to its documentation, the Swiss data set consists of a standardized fertility measure and socioeconomic indicators for each of 47 French-speaking provinces of Switzerland in about 1888 when Swiss fertility rates began to fall. Type head(swiss) or View(swiss) to examine the data.
qownnotes-media-IhrIMw

qownnotes-media-IhrIMw

Nice work!

|================================= | 58%

Fertility was thought to depend on five socioeconomic factors: the percent of males working in Agriculture, the percent of draftees receiving the highest grade on the army’s Examination, the percent of draftees with Education beyond primary school, the percent of the population which was Roman Catholic, and the rate of Infant Mortality in the province. Use linear regression to model Fertility in terms of these five regressors and an intercept. Store the model in a variable named mdl.

mdl <- lm(fertility ~ Agriculture+Examination+Education+Catholic+Infant.Mortality,swiss) Error in eval(expr, envir, enclos) : object ‘fertility’ not found mdl <- lm(Fertility ~ Agriculture+Examination+Education+Catholic+Infant.Mortality,swiss)

Perseverance, that’s the answer.

|==================================== | 62%

Calculate the VIF’s for each of the regressors using vif(mdl).

vif(mdl) Agriculture Examination Education Catholic Infant.Mortality 2.284129 3.675420 2.774943 1.937160 1.107542

That’s correct!

|====================================== | 67%

These VIF’s show, for each regression coefficient, the variance inflation due to including all the others. For instance, the variance in the estimated coefficient of Education is 2.774943 times what it might have been if Education were not correlated with the other regressors. Since Education and score on an Examination are likely to be correlated, we might guess that most of the variance inflation for Education is due to including Examination.
Make a second linear model of Fertility in which Examination is omitted, but the other four regressors are included. Store the result in a variable named mdl2.

mdl2 <- lm(Fertility ~ Agriculture+Education+Catholic+Infant.Mortality,swiss)

Your dedication is inspiring!

|=========================================== | 75%

Calculate the VIF’s for this model using vif(mdl2).

vif(mdl2) Agriculture Education Catholic Infant.Mortality 2.147153 1.816361 1.299916 1.107528

You are quite good my friend!

|============================================= | 79%

As expected, omitting Examination has markedly decreased the VIF for Education, from 2.774943 to 1.816361. Note that omitting Examination has had almost no effect the VIF for Infant Mortality. Chances are Examination and Infant Mortality are not strongly correlated. Now, before finishing this lesson, let’s review several significant points.

|=============================================== | 83%

A VIF describes the increase in the variance of a coefficient due to the correlation of its regressor with the other regressors. What is the relationship of a VIF to the standard error of its coefficient?

1: They are the same. 2: There is no relationship. 3: VIF is the square of standard error inflation.

Selection: 3

You nailed it! Good job!

|================================================== | 88%

If a regressor is strongly correlated with others, hence will increase their VIF’s, why shouldn’t we just exclude it?

1: We should never exclude anything. 2: We should always exclude it. 3: Excluding it might bias coefficient estimates of regressors with which it is correlated.

Selection: 3

Keep up the great work!

|==================================================== | 92%

The problems of variance inflation and bias due to excluded regressors both involve correlated regressors. However there are methods, such as factor analysis or principal componenent analysis, which can convert regressors to an equivalent uncorrelated set. Why then, when modeling, should we not just use uncorrelated regressors and avoid all the trouble?

1: We should always use uncorrelated regressors. 2: Using converted regressors may make interpretation difficult. 3: Factor analysis takes too much computation.

Selection: 3

Keep up the great work!

|==================================================== | 92%

The problems of variance inflation and bias due to excluded regressors both involve correlated regressors. However there are methods, such as factor analysis or principal componenent analysis, which can convert regressors to an equivalent uncorrelated set. Why then, when modeling, should we not just use uncorrelated regressors and avoid all the trouble?

1: We should always use uncorrelated regressors. 2: Using converted regressors may make interpretation difficult. 3: Factor analysis takes too much computation.

Selection: 2

Excellent job!

ANOVA

Finalemte existe una herramienta que nos permite ir viendo progresivamente como evoluciona la inclusión de variables se llama ANOVA, o nested testing models. Se trata de ir añadiendo progresivamente cada uno de los regresores estudiando su impacto.

qownnotes-media-igtMZz

qownnotes-media-igtMZz

Overfitting and Underfitting (ANOVA)

En el apartado de VIP se demostró cómo incluyendo nuevas variables se incrementa el error estándar de los coeficientes estimados de otras regresores correlacionados. Por tanto, no queremos por no pensar mucho añadir más y más variables dentro del modelo. Por otra parte, omitir varibles puede dar resultado sesgar los coeficientes de los reregresores que están relacionados con los omitidos. A continuación se ven ejemplos del impacto de variables omitidas con el uso de ANOVA para construir representaciones de los datos interpretables y escueto.

|========== | 7%

First, I would like to illustrate how omitting a correlated regressor can bias estimates of a coefficient. The relevant source code is in a file named fitting.R which I have copied into your working directory and tried to display in your source code editor. If I’ve failed to display it, you should open it manually.
simbias <- function(seed=8765){
  # The default seed guarantees a nice histogram. This is the only
  # reason that accepting the default, x1c <- simbias(), is required in the lesson. 
  # The effect will be evident with other seeds as well.
  set.seed(seed) 
  temp <- rnorm(100)
  # Point A
  x1 <- (temp + rnorm(100))/sqrt(2)
  x2 <- (temp + rnorm(100))/sqrt(2)
  x3 <- rnorm(100)
  # Function to simulate regression of y on 2 variables.
  f <- function(k){
    # Point B
    y <- x1 + x2 + x3 + .3*rnorm(100)
    # Point C
    c(lm(y ~ x1 + x2)$coef[2],
       lm(y ~ x1 + x3)$coef[2])
  }
  # Point D
  sapply(1:150, f)
}

|============== | 11%

Find the function simbias() at the top of fitting.R. Below the comment labeled Point A three regressors, x1, x2, and x3, are defined. Which of these two are correlated?

1: x2 and x3 2: x1 and x3 3: x1 and x2

Selection: 3

Great job!

|=================== | 15%

Within simbias() another function, f(n), is defined. It forms a dependent variable, y, and at Point C returns the coefficient of x1 as estimated by two models, y ~ x1 + x2, and y ~ x1 + x3. One regressor is missing in each model. In the expression for y (Point B,) what is the actual coefficient of x1?

1: 1 2: 0.3 3: 1/sqrt(2)

Selection: 1

Excellent job!

|======================== | 19%

At Point D in simbias() the internal function, f(), is applied 150 times and the results returned as a 2x150 matrix. The first row of this matrix contains independent estimates of x1’s coefficient in the case that x3, the regressor uncorrelated with x1, is omitted. The second row contains estimates of x1’s coefficient when the correlated regressor, x2, is omitted. Use simbias(), accepting the default argument, to form these estimates and store the result in a variable called x1c. (The default argument just guarantees a nice histogram, in a figure to follow.)

x1c <- simbias()

You got it right!

|============================= | 22%

The actual coefficient of x1 is 1. Having been warned that omitting a correlated regressor would bias estimates of x1’s coefficient, we would expect the mean estimate of x1c’s second row to be farther from 1 than the mean of x1c’s first row. Using apply(x1c, 1, mean), find the means of each row.
You got it right!

|============================= | 22%

The actual coefficient of x1 is 1. Having been warned that omitting a correlated regressor would bias estimates of x1’s coefficient, we would expect the mean estimate of x1c’s second row to be farther from 1 than the mean of x1c’s first row. Using apply(x1c, 1, mean), find the means of each row.

apply(x1c,1,mean) x1 x1 1.034403 1.476944

You are quite good my friend!

|================================== | 26%

Histograms of estimates from x1c’s first row (blue) and second row (red) are shown. Estimates from the second row are clearly more than two standard deviations from the correct value of 1, and the bias due to omitting the correlated regressor is evident. (The code which produced this figure is incidental to the lesson, but is available as the function x1hist(), at the bottom of fitting.R.)
qownnotes-media-IdkMyQ

qownnotes-media-IdkMyQ

# Plot histograms illustrating bias in estimates of a regressor
# coefficient 1) when an uncorrelated regressor is missing and
# 2) when a correlated regressor is missing.
x1hist <- function(x1c){
  p1 <- hist(x1c[1,], plot=FALSE)
  p2 <- hist(x1c[2,], plot=FALSE)
  yrange <- c(0, max(p1$counts, p2$counts))
  plot(p1, col=rgb(0,0,1,1/4), xlim=range(x1c), ylim=yrange, xlab="Estimated coefficient of x1",
        main="Bias Effect of Omitted Regressor")
  plot(p2, col=rgb(1,0,0,1/4), xlim=range(x1c), ylim=yrange, add=TRUE)
  legend(1.1, 40, c("Uncorrelated regressor, x3, omitted", "Correlated regressor, x2, omitted"),
         fill=c(rgb(0,0,1,1/4), rgb(1,0,0,1/4)))
}
Adding even irrelevant regressors can cause a model to tend toward a perfect fit. We illustrate this by adding random regressors to the swiss data and regressing on progressively more of them. As the number of regressors approaches the number of data points (47), the residual sum of squares, also known as the deviance, approaches 0. (The source code for this figure can be found as function bogus() in fitting.R.
qownnotes-media-OAXbzg

qownnotes-media-OAXbzg

# Illustrate the effect of bogus regressors on residual squared error.
bogus <- function(){
  temp <- swiss
  # Add 41 columns of random regressors to a copy of the swiss data.
  for(n in 1:41){temp[,paste0("random",n)] <- rnorm(nrow(temp))}
  # Define a function to compute the deviance of Fertility regressed
  # on all regressors up to column n. The function, deviance(model), computes
  # the residual sum of squares of the model given as its argument.
  f <- function(n){deviance(lm(Fertility ~ ., temp[,1:n]))}
  # Apply f to data from n=6, i.e., the legitimate regressors,
  # through n=47, i.e., a full complement of bogus regressors.
  rss <- sapply(6:47, f)
  # Display result.
  plot(0:41, rss, xlab="Number of bogus regressors.", ylab="Residual squared error.",
       main="Residual Squared Error for Swiss Data\nUsing Irrelevant (Bogus) Regressors",
       pch=21, bg='red')
}
In the figure, adding random regressors decreased deviance, but we would be mistaken to believe that such decreases are significant. To assess significance, we should take into account that adding regressors reduces residual degrees of freedom. Analysis of variance (ANOVA) is a useful way to quantify the significance of additional regressors. To exemplify its use, we will use the swiss data.

|================================================ | 37%

Recall that the Swiss data set consists of a standardized fertility measure and socioeconomic indicators for each of 47 French-speaking provinces of Switzerland in 1888. Fertility was thought to depend on an intercept and five factors denoted as Agriculture, Examination, Education, Catholic, and Infant Mortality. To begin our ANOVA example, regress Fertility on Agriculture and store the result in a variable named fit1.

fit1 <- lm(Fertility ~ Agriculture,swiss)

You are doing so well!

|===================================================== | 41%

Create another model, named fit3, by regressing Fertility on Agriculture and two additonal regressors, Examination and Education.

fit3 <- lm(Fertility ~ Agriculture+Examination+Education,swiss)

You are doing so well!

|========================================================== | 44%

We’ll now use anova to assess the significance of the two added regressors. The null hypothesis is that the added regressors are not significant. We’ll explain in detail shortly, but right now just apply the significance test by entering anova(fit1, fit3).

anova(fit1,fit3)

qownnotes-media-pkQDWT

qownnotes-media-pkQDWT

The three asterisks, ***, at the lower right of the printed table indicate that the null hypothesis is rejected at the 0.001 level, so at least one of the two additional regressors is significant. Rejection is based on a right-tailed F test, Pr(>F), applied to an F value. According to the table, what is that F value?

1: 3102.2 2: 20.968 3: 45

Selection: 2

You nailed it! Good job!

|=================================================================== | 52%

An F statistic is a ratio of two sums of squares divided by their respective degrees of freedom. If the two scaled sums are independent and centrally chi-squared distributed with the same variance, the statistic will have an F distribution with parameters given by the two degrees of freedom. In our case, the two sums are residual sums of squares which, as we know, have mean zero hence are centrally chi-squared provided the residuals themselves are normally distributed. The two relevant sums are given in the RSS (Residual Sum of Squares) column of the table. What are they?

1: 45 and 43 2: 2 and 3102.2 3: 6283.1 and 3180.9

Selection: 3

Keep up the great work!

|======================================================================== | 56%

R’s function, deviance(model), calculates the residual sum of squares, also known as the deviance, of the linear model given as its argument. Using deviance(fit3), verify that 3180.9 is fit3’s residual sum of squares. (Of course, fit3 is called Model 2 in the table.)

deviance(fit3) [1] 3180.925

You are doing so well!

|============================================================================= | 59%

In the next several steps, we will show how to calculate the F value, 20.968, which appears in the table printed by anova(). We’ll begin with the denominator, which is fit3’s residual sum of squares divided by its degrees of freedom. Fit3 has 43 residual degrees of freedom. This figure is obtained by subtracting 4, the the number of fit3’s predictors (the 3 named and the intercept,) from 47, the number of samples in swiss. Store the value of deviance(fit3)/43 in a variable named d.

d <- deviance(fit3)/43

Great job!

|================================================================================== | 63%

The numerator is the difference, deviance(fit1)-deviance(fit3), divided by the difference in the residual degrees of freedom of fit1 and fit3, namely 2. This calculation requires some theoretical justification which we omit, but the essential idea is that fit3 has 2 predictors in addition to those of fit1. Calculate the numerator and store it in a variable named n.

n <- deviance(fit1)-deviance(fit3)

Not quite right, but keep trying. Or, type info() for more options.
Enter n <- (deviance(fit1) - deviance(fit3))/2 at the R prompt.

n <- (deviance(fit1) - deviance(fit3))/2

Keep up the great work!

|======================================================================================= | 67%

Calculate the ratio, n/d, to show it is essentially equal to the F value, 20.968, given by anova().

n/d [1] 20.96783

You are quite good my friend!

|=========================================================================================== | 70%

We’ll now calculate the p-value, which is the probability that a value of n/d or larger would be drawn from an F distribution which has parameters 2 and 43. This value was given as 4.407e-07 in the column labeled Pr(>F) in the table printed by anova(), a very unlikely value if the null hypothesis were true. Calculate this p-value using pf(n/d, 2, 43, lower.tail=FALSE).

pf(n/d, 2, 43, lower.tail=FALSE) [1] 4.406913e-07

Excellent job!

|================================================================================================ | 74%

Based on the calculated p-value, a false rejection of the null hypothesis is extremely unlikely. We are confident that fit3 is significantly better than fit1, with one caveat: analysis of variance is sensitive to its assumption that model residuals are approximately normal. If they are not, we could get a small p-value for that reason. It is thus worth testing residuals for normality. The Shapiro-Wilk test is quick and easy in R. Normality is its null hypothesis. Use shapiro.test(fit3$residuals) to test the residual of fit3.

shapiro.test(fit3$residuals)

Shapiro-Wilk normality test

data: fit3$residuals W = 0.97276, p-value = 0.336

That’s correct!

|===================================================================================================== | 78%

The Shapiro-Wilk p-value of 0.336 fails to reject normality, supporting confidence in our analysis of variance. In order to illustrate the use of anova() with more than two models, I have constructed fit5 and fit6 using the first 5 and all 6 regressors (including the intercept) respectively. Thus fit1, fit3, fit5, and fit6 form a nested sequence of models; the regressors of one are included in those of the next. Enter anova(fit1, fit3, fit5, fit6) at the R prompt now to get the flavor.
qownnotes-media-KXGTlK

qownnotes-media-KXGTlK

You nailed it! Good job!

|========================================================================================================== | 81%

It appears that each model is a significant improvement on its predecessor. Before ending the lesson, let’s review a few salient points.

|=============================================================================================================== | 85%

Omitting a regressor can bias estimation of the coefficient of certain other regressors. Which ones?

1: Uncorrelated regressors 2: Correlated regressors

Selection: 2

You got it!

|==================================================================================================================== | 89%

Including more regressors will reduce a model’s residual sum of squares, even if the new regressors are irrelevant. True or False?

1: False 2: It depends on circumstances. 3: True

Selection: 1

You’re close…I can feel it! Try it again.
It doesn’t depend on circumstances.

1: True 2: False 3: It depends on circumstances.

Selection: 3

You almost had it, but not quite. Try again.
It doesn’t depend on circumstances.

1: False 2: True 3: It depends on circumstances.

Selection: 2

You nailed it! Good job!

|======================================================================================================================== | 93%

When adding regressors, the reduction in residual sums of squares should be tested for significance above and beyond that of reducing residual degrees of freedom. R’s anova() function uses an F-test for this purpose. What else should be done to insure that anova() applies?

1: Regressors should be tested for normality. 2: Model residuals should be tested for normality. 3: The residuals should be tested for having zero means.

Selection: 3

You’re close…I can feel it! Try it again.
F-tests are sensitive to the assumption of normality.

1: The residuals should be tested for having zero means. 2: Model residuals should be tested for normality. 3: Regressors should be tested for normality.

Selection: 2

You are doing so well!

TEST3

qownnotes-media-FyoRQF

qownnotes-media-FyoRQF

qownnotes-media-PDMYZn

qownnotes-media-PDMYZn

qownnotes-media-yEfYeY

qownnotes-media-yEfYeY

qownnotes-media-syOIex

qownnotes-media-syOIex

qownnotes-media-WugCgL

qownnotes-media-WugCgL

qownnotes-media-IxTzuQ

qownnotes-media-IxTzuQ

GLM (Generalized Linear Models)

Los Modelos Lineales Generalizados es la familia de modelos que incluyen a los modelos lineales, pero incluyen modelos adicionales para los casos en los que son limitados, por ejemplo en respuestas discretas (cómo calculamos el error acumulado) o estrictamente positivas (podríamos usar una transformación logarítmica, o una del arcoseno para los resultados binarios o el logaritmo neperiano para valores sólo positivos), pero con las trasnformaciones muchas veces alteramos los coeficientes de nuestro modelo y son difíciles de interpretar, además de que ya no tenemos los datos en las mismas unidades), por lo que también incluyen regresiones binomiales y binarias así como regresiones de poisson.

Los GLM surgen en 1972 para mejorar las técnicas de transformación. Están formados por 3 componentes:

- Un componente de aleatorieidad proveniente de un modelo de familia exponencial (normal,binomial y poisson) para la respuesta (por ejemplo en los      modelos lineales este componente aleatorio eran los errores)
- un componente sistemático via predictor lineal que es la parte que estamos modelando (en modelos lineales son las covariables y los coeficientes      del componente lineal)
- Una link functon que conecte la media de la respuesta de la familia exponencial con el preditor lineal. Es fundamental entender aquí que no           estaremos transformando la respuesta (las Ys) sino la media de la distribución.

En este curso nos centraremos en las regresiones binomiales o logarítmicas y de poisson, y nos focalizaremos únicamente en las link function más populares.

Ejemplo de regresión lineal * Se asume que \(Y_i \sim N(\mu_i, \sigma^2)\) (la distribución de Gauss es de la familia exponencial.) * Se define el predictor lineal como \(\eta_i = \sum_{k=1}^p X_{ik} \beta_k\). * La link function se define como \(g\) de manera que \(g(\mu) = \eta\). * Para modelos lineales \(g(\mu) = \mu\) de manera que \(\mu_i = \eta_i\) * Esto resulta tener el mismo modelo de verosimilitud likelihood que nuestro error aditivo del modelo linear Gaussiano \[ Y_i = \sum_{k=1}^p X_{ik} \beta_k + \epsilon_{i} \] donde \(\epsilon_i \stackrel{iid}{\sim} N(0, \sigma^2)\)

Regresión de Poisson * Se asume que \(Y_i \sim Poisson(\mu_i)\) de manera que \(E[Y_i] = \mu_i\) donde \(0\leq \mu_i\) * Predictor lineal \(\eta_i = \sum_{k=1}^p X_{ik} \beta_k\) * Link function \(g(\mu) = \eta = \log(\mu)\) * Recordad que \(e^x\) es la inversa de \(\log(x)\) de manera que
\[ \mu_i = e^{\eta_i} \] Por tanto la verosimilitud es \[ \prod_{i=1}^n (y_i !)^{-1} \mu_i^{y_i}e^{-\mu_i} \propto \exp\left(\sum_{i=1}^n y_i \eta_i - \sum_{i=1}^n \mu_i\right) \]

funcionLogistica

funcionLogistica

  • En cada caso, la única manera en la que la verosimilitud depende de los datos es: \[\sum_{i=1}^n y_i \eta_i = \sum_{i=1}^n y_i\sum_{k=1}^p X_{ik} \beta_k = \sum_{k=1}^p \beta_k\sum_{i=1}^n X_{ik} y_i \] Por tanto no necesitamos todos los datos,sólo \(\sum_{i=1}^n X_{ik} y_i\). Esta simplificación es una consecuencia de elegir las llamadas ‘canonical’ link functions.
  • (Esto tiene que derivarse). Todos los modelos consiguen su máximo en la raiz de las llamadas ecuaciones normales \[ 0=\sum_{i=1}^n \frac{(Y_i - \mu_i)}{Var(Y_i)}W_i \] donde \(W_i\) son las derivadas de la inversa de la link function.

Una anotación sobre las varianzas, ya que en los modelos lineales la varianza la suponemos constante, mientras que en los casos de bernuilli y de poisson la varianza es variable dependiendo de la observación. En el caso de que las varianzas nos se apeguen lo suficiente a la estructura de varianza definido en el GLM, R proporciona una herramienta llamadas quasi_poisson y quasi que permiten flexibilizar un poco estas restriciones.

  • Para el modelo lineal \(Var(Y_i) = \sigma^2\) es constante.
  • Para Bernuilli \(Var(Y_i) = \mu_i (1 - \mu_i)\)
  • Para Poisson \(Var(Y_i) = \mu_i\).
  • En determinados casos, es a menudo relevante tener un modelo de varianza más flexible, incluso si no se corresponde exactamente con la verosimilitud real \[ 0=\sum_{i=1}^n \frac{(Y_i - \mu_i)}{\phi \mu_i (1 - \mu_i ) } W_i ~~~\mbox{and}~~~ 0=\sum_{i=1}^n \frac{(Y_i - \mu_i)}{\phi \mu_i} W_i \]
  • Estas son las llamadas ecuaciones normales ‘quasi-likelihood’

Resultados binarios (Bernuilli)

Es muy frecuente encontrarse con resultados que pueden tener únicamente dos posibles valores como vivir o morir, ganar o perder, éxito o fracaso. Estos resultados son llamados binarios, de Bernuilli o 0/1. Una colección de resultados intercambiables binarios para los mismos datos covariables se llaman resultados binomiales. (Los resultados son intercambiables si el orden no importa)

  • Bernoulli/binary son usados para modelar salidas con sólo dos posibles resultados
    • alive vs dead
    • win vs loss
    • success vs failure
    • disease vs healthy
  • binomial outcomes = colección de salidas binarias intercambiables (i.e. tirar una moneda repetidamente) para los mismos datos covariables
    • En otras palabras, estamos interesados en conocer el número de \(1\)s predichos vs el número de \(0\)s en lugar de un resultado concreto que sea \(1\) o \(0\)

odds son muy útiles para la construcción de modelos de regresión logistica y son fáciles de interpretar - Vamos a imaginar tirar una moneda con una probabilidad de exito \(p\) + si es cara, ganas \(X\) + si sale cruz, pierdes \(Y\) - Qué valor deberían tener \(X\) e \(Y\) para que el juego sea justo? \[E[earnings]= X p - Y (1 - p) = 0 \Rightarrow \frac{Y}{X} = \frac{p}{1 - p}\] + odds se puede interpretar como “Cuánto estás dispuesto a pagar para una probabilidad \(p\) de ganar 1 dolar?” + si \(p > 0.5\), tienes que pagar más si pierdes que lo que consigues si ganas + si \(p < 0.5\), tienes que pagar menos si pierdes que lo que ganas si aciertas

  • odds NO son probabilidades
  • odds ratio de 1 = no hay diferencia en odds o 50% - 50%
    • \(p = 0.5 \Rightarrow odds = \frac{0.5}{1-0.5} = 1\)
    • log odds ratio of 0 = no difference in odds
      • \(p = 0.5 \Rightarrow odds = \log\left(\frac{0.5}{1-0.5}\right) = \log(1) = 0\)
  • odds ratio < 0.5 or > 2 se llama comunmente un efecto moderado
  • Riesgo relativo = tasa de probabilidad en lugar de odds, son a menudo fáciles de interpretar pero más difícil de estimar \[\frac{Pr(W_i | S_i = 10)}{Pr(W_i | S_i = 0)}\]
    • Nota: El riesgo relativo a menudo tiene problemas de límite ya el rango de \(\log(p)\) es \((-\infty,~0]\) mientras el rango de logit \(\frac{p}{1-p}\) is \((-\infty,\infty)\)
    • Para probabilidades pequeñas el riesgo relativo \(\approx\) Odds Ratio pero no son lo mismo!

Vamos a hacer un ejemplo con predecir la victoria de los Ravens dependiendo del número de puntos que anote. Para eso tenemos un dataset con todos sus resultados.

Ejemplo - Simple Linear Regression

  • simple linear regression puedes ser usada para modelar victorias vs derrotas para los Ravens \[ W_i = \beta_0 + \beta_1 S_i + \epsilon_i \]
    • \(W_i\) = binary outcome, 1 if a Ravens win, 0 if not
    • \(S_i\) = number of points Ravens scored
    • \(\beta_0\) = probability of a Ravens win if they score 0 points
    • \(\beta_1\) = increase in probability of a Ravens win for each additional point
    • \(\epsilon_i\) = residual variation, error
  • El valor esperado para el modelo está definido por \[E[W_i | S_i, \beta_0, \beta_1] = \beta_0 + \beta_1 S_i\]
  • Sin embargo, el modelo no funcionará bien ya que los resultados a predecir no serán 0 vs 1 sino decimal
    • El término de error, \(\epsilon_i\), se presume contínuo y distribuido normal, significando que la predicción probablemente sea decimal
    • por tanto, NO es un buen modelo de presunción
# perform linear regression
# load the data
load("ravensData.rda")
head(ravensData)
##   ravenWinNum ravenWin ravenScore opponentScore
## 1           1        W         24             9
## 2           1        W         38            35
## 3           1        W         28            13
## 4           1        W         34            31
## 5           1        W         44            13
## 6           0        L         23            24
summary(lm(ravenWinNum ~ ravenScore, data = ravensData))
## 
## Call:
## lm(formula = ravenWinNum ~ ravenScore, data = ravensData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.7302 -0.5076  0.1824  0.3215  0.5719 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 0.285032   0.256643   1.111   0.2814  
## ravenScore  0.015899   0.009059   1.755   0.0963 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4464 on 18 degrees of freedom
## Multiple R-squared:  0.1461, Adjusted R-squared:  0.09868 
## F-statistic:  3.08 on 1 and 18 DF,  p-value: 0.09625
  • como era de esperar, el modelo produce un pobre ajuste para los datos (\(R^2_{adj} =\) 0.0987)
  • Añadiendo un umbral a la salida generada (i.e. si \(\hat W_i < 0.5, \hat W_i = 0\)) y usando el modelo para predecir el resultado sería viable
    • sinembargo, los coeficientes para el modelo no son muy interpretables

Recurrimos a la regresión logarítmica

  • Probabilidad de victoria de los Ravens quedaría definido por \[Pr(W_i | S_i, \beta_0, \beta_1)\]
  • odds queda definido como s \[\frac{Pr(W_i | S_i, \beta_0, \beta_1 )}{1-Pr(W_i | S_i, \beta_0, \beta_1)}\] con rangos de 0 a \(\infty\)
  • log odds o logit se define como \[\log\left(\frac{Pr(W_i | S_i, \beta_0, \beta_1 )}{1-Pr(W_i | S_i, \beta_0, \beta_1)}\right)\] yendo de \(-\infty\) a \(\infty\)
  • Podemos usar la link function y predictores lineales para construir el modelo de regresion logística \[\begin{aligned} g(\mu_i) & = \log \left(\frac{\mu_i}{1 - \mu_i} \right) = \eta_i\\ (substitute~\mu_i = Pr(W_i | S_i, \beta_0, \beta_1))~g(\mu_i) & = \log\left(\frac{Pr(W_i | S_i, \beta_0, \beta_1)}{1-Pr(W_i | S_i, \beta_0, \beta_1)}\right) = \eta_i \\ (substitute~\eta_i=\beta_0 + \beta_1 S_i) \Rightarrow ~g(\mu_i) & = \log\left(\frac{Pr(W_i | S_i, \beta_0, \beta_1)}{1-Pr(W_i | S_i, \beta_0, \beta_1)}\right) = \beta_0 + \beta_1 S_i\\ \end{aligned} \] que también se puede escribir como \[Pr(W_i | S_i, \beta_0, \beta_1 ) = \frac{\exp(\beta_0 + \beta_1 S_i)}{1 + \exp(\beta_0 + \beta_1 S_i)}\]
  • for the model \[\log\left(\frac{Pr(W_i | S_i, \beta_0, \beta_1)}{1-Pr(W_i | S_i, \beta_0, \beta_1)}\right) = \beta_0 + \beta_1 S_i\]
    • \(\beta_0\) = log odds of a Ravens win if they score zero points
    • \(\beta_1\) = log odds ratio of win probability for each point scored (compared to zero points) \[\beta_1 = \log\left(odds(S_i = S_i+1)\right) - \log\left(odds(S_i = S_i)\right) = \log\left(\frac{odds(S_i = S_i+1)}{odds(S_i = S_i)} \right)\]
    • \(\exp(\beta_1)\) = odds ratio of win probability for each point scored (compared to zero points) \[\exp(\beta_1) = \frac{odds(S_i = S_i+1)}{odds(S_i = S_i)}\]
  • mediante esta función manupulate podemos modificar \(\beta_0\) y \(\beta_1\) y ver como afecta esa modificación la curva de regresión logistica de los datos simulados
# set x values for the points to be plotted
x <- seq(-10, 10, length = 1000)
# "library(manipulate)" is needed to use the manipulate function
manipulate(
    # plot the logistic regression curve
    plot(x, exp(beta0 + beta1 * x) / (1 + exp(beta0 + beta1 * x)),
         type = "l", lwd = 3, frame = FALSE),
    # slider for beta1
    beta1 = slider(-2, 2, step = .1, initial = 2),
    # slider for beta0
    beta0 = slider(-2, 2, step = .1, initial = 0)
    )

Para hacer la regresión logística

# run logistic regression on data
logRegRavens <- glm(ravenWinNum ~ ravenScore, data = ravensData,family="binomial")
# print summary
summary(logRegRavens)
## 
## Call:
## glm(formula = ravenWinNum ~ ravenScore, family = "binomial", 
##     data = ravensData)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7575  -1.0999   0.5305   0.8060   1.4947  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.68001    1.55412  -1.081     0.28
## ravenScore   0.10658    0.06674   1.597     0.11
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 24.435  on 19  degrees of freedom
## Residual deviance: 20.895  on 18  degrees of freedom
## AIC: 24.895
## 
## Number of Fisher Scoring iterations: 5
  • Como se puede ver, los coeficientes \(\beta_0\) y \(\beta_1\) son -1.68, 0.107, que son interpretados para ser los log odds ratios
  • Podemos convertir los log ratios así como los intervalos de confianza de log a ratios e intervalos de confianza (en las mismas unidades que los datos)
# take e^coefs to find the log ratios
exp(logRegRavens$coeff)
## (Intercept)  ravenScore 
##   0.1863724   1.1124694
# take e^log confidence interval to find the confidence intervals
exp(confint(logRegRavens))
## Waiting for profiling to be done...
##                   2.5 %   97.5 %
## (Intercept) 0.005674966 3.106384
## ravenScore  0.996229662 1.303304
  • Nota: \(\exp(x) \approx 1 + x\) para valores pequeños (cerca de 0) de x, esto puede ser una manera rápida de estimar los coeficientes
  • Podemos interpretar la pendiente , \(\beta_1\) como 11.247 % de incremento de probabilidad de ganar por cada punto marcado
  • Podemos interpretar el intercept, \(\beta_0\) como 0.186 es el odds de ganar de los Ravens si marcan 0 puntos
    • Nota: similar al intercept de un modelo de regresión simple, ek intercept debe ser interpretado con cuidado ya que es un valor extrapolado del modelo y puede que no tenga significado práctico
  • para calcular una probabilidad específica de ganar para un número de puntos \[Pr(W_i | S_i, \hat \beta_0, \hat \beta_1) = \frac{\exp(\hat \beta_0 + \hat \beta_1 S_i)}{1 + \exp(\hat \beta_0 + \hat \beta_1 S_i)}\]
  • La curva de regresión logistica que resulta es
# plot the logistic regression
plot(ravensData$ravenScore,logRegRavens$fitted,pch=19,col="blue",xlab="Score",ylab="Prob Ravens Win")

  • ANOVA puede ser llevado a cabo en una regresión logarítmica de un parámetro, donde se analiza el cambio de varianzas con la adición de parámetros al modelo, o multiple nested logistic regression (similar a linear models)
# perform analysis of variance
anova(logRegRavens,test="Chisq")
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: ravenWinNum
## 
## Terms added sequentially (first to last)
## 
## 
##            Df Deviance Resid. Df Resid. Dev Pr(>Chi)  
## NULL                          19     24.435           
## ravenScore  1   3.5398        18     20.895  0.05991 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • ANOVA devuelve información sobre el modelo,la link function, la respuesta, así como el análisis de la varianza por anádir términos
    • Df = change in degrees of freedom
      • the value 1 refers to adding the ravenScore parameter (slope)
    • Deviance = measure of goodness of model fit compare to the previous model
    • Resid. Dev = residual deviance for current model
    • Pr(>Chi) = used to evaluate the significance of the added parameter
      • in this case, the Deviance value of 3.54 is used to find the corresponding p-value from the Chi Squared distribution, which is 0.06
        • Nota: La distribución Chi Squared con 1 grado de libertad es simplemente la raiz cuadrada de la distribución normal normal , así que el z statistic de 2 corresponde al 95% para distribución normal indica que la desviación de 4 corresponde a approximadamente 5% en la distribución Chi Squared (que es lo que muestra el resultado)

Ejercicios

The Baltimore Ravens are a team in the American Football League. In post season (championship) play they win about 2/3 of their games. In other words, they win about twice as often as they lose. If I wanted to bet on them, I would have to offer 2-to-1 odds–if they lost I would pay you $2, but if they won you would pay me only $1. That way, in the long run over many bets, we’d both expect to win as much money as we’d lost.

During the regular season the Ravens win about 55% of their games. What odds would I have to offer in the regular season?

1: 11 to 9 2: 1.22222 to 1 3: Any of these 4: 55 to 45

Selection: 2

Great job!

|================== | 19%

All of the answers are correct because they all represent the same ratio. If p is the probability of an event, the associated odds are p/(1-p).

|====================== | 23%

Now suppose we want to see how the Ravens’ odds depends on their offense. In other words, we want to model how p, or some function of it, depends on how many points the Ravens are able to score. Of course, we can’t observe p, we can only observe wins, losses, and the associated scores. Here is a Box plot of one season’s worth of such observations.
qownnotes-media-ylCMqU

qownnotes-media-ylCMqU

We can see that the Ravens tend to win more when they score more points. In fact, about 3/4 of their losses are at or below a certain score and about 3/4 of their wins are at or above it. What score am I talking about? (Remember that the purple boxes represent 50% of the samples, and the “T’s” 25%.)

1: 23 2: 30 3: 40 4: 18

Selection: Enter an item from the menu, or 0 to exit Selection: 1

All that practice is paying off!

|============================== | 31%

There were 9 games in which the Ravens scored 23 points or less. They won 4 of these games, so we might guess their probability of winning, given that they score 23 points or less, is about 1/2.

|================================= | 35%

There were 11 games in which the Ravens scored 24 points or more. They won all but one of these. Verify this by checking the data yourself. It is in a data frame called ravenData. Look at it by typing either ravenData or View(ravenData).

ravenData ravenWinNum ravenWin ravenScore 1 1 W 9 2 0 L 13 3 1 W 13 4 1 W 16

We see a fairly rapid transition in the Ravens’ win/loss record between 23 and 28 points. At 23 points and below they win about half their games, between 24 and 28 points they win 3 of 4, and above 28 points they win them all. From this, we get a very crude idea of the correspondence between points scored and the probability of a win. We get an S shaped curve, a graffiti S anyway.
qownnotes-media-gDqand

qownnotes-media-gDqand

Of course, we would expect a real curve to be smoother. We would not, for instance, expect the Ravens to win half the games in which they scored zero points, nor to win all the games in which they scored more than 28. A generalized linear model which has these properties supposes that the log odds of a win depend linearly on the score. That is, log(p/(1-p)) = b0 + b1*score. The link function, log(p/(1-p)), is called the logit, and the process of finding the best b0 and b1, is called logistic regression.

|============================================ | 46%

The “best” b0 and b1 are those which maximize the likelihood of the actual win/loss record. Based on the score of a game, b0 and b1 give us a log odds, which we can convert to a probability, p, of a win. We would like p to be high for the scores of winning games, and low for the scores of losses.

|================================================ | 50%

We can use R’s glm() function to find the b0 and b1 which maximize the likelihood of our observations. Referring back to the data frame, we want to predict the binary outcomes, ravenWinNum, from the points scored, ravenScore. This corresponds to the formula, ravenWinNum ~ ravenScore, which is the first argument to glm. The second argument, family, describes the outcomes, which in our case are binomial. The third argument is the data, ravenData. Call glm with these parameters and store the result in a variable named mdl.

mdl <- glm(ravenWinNum ~ ravenScore, binomial, ravenData)

That’s correct!

|==================================================== | 54%

The probabilities estimated by logistic regression using glm() are represented by the black curve. It is more reasonable than our crude estimate in several respects: It increases smoothly with score, it estimates that 15 points give the Ravens a 50% chance of winning, that 28 points give them an 80% chance, and that 55 points make a win very likely (98%) but not absolutely certain.
qownnotes-media-zBwzji

qownnotes-media-zBwzji

The model is less credible at scores lower than 9. Of course, there is no data in that region; the | Ravens scored at least 9 points in every game. The model gives them a 33% chance of winning if they | score 9 points, which may be reasonable, but it also gives them a 16% chance of winning even if they | score no points! We can use R’s predict() function to see the model’s estimates for lower scores. The | function will take mdl and a data frame of scores as arguments and will return log odds for the give | scores. Call predict(mdl, data.frame(ravenScore=c(0, 3, 6))) and store the result in a variable called | lodds.

lodds <- predict(mdl, data.frame(ravenScore=c(0, 3, 6)))

Perseverance, that’s the answer.

|=========================================================== | 62%

Since predict() gives us log odds, we will have to convert to probabilities. To convert log odds to probabilities use exp(lodds)/(1+exp(lodds)). Don’t bother to store the result in a variable. We won’t need it.

exp(lodds)/(1+exp(lodds)) 1 2 3 0.1570943 0.2041977 0.2610505

Nice work!

|=============================================================== | 65%

As you can see, a person could make a lot of money betting against this model. When the Ravens score no points, the model might like 16 to 84 odds. As it turns out, though, the model is not that sure of itself. Typing summary(mdl) you can see the estimated coefficients are both within 2 standard errors of zero. Check out the summary now.

summary(mdl)

Call: glm(formula = ravenWinNum ~ ravenScore, family = binomial, data = ravenData)

Deviance Residuals: Min 1Q Median 3Q Max
-1.7575 -1.0999 0.5305 0.8060 1.4947

Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.68001 1.55412 -1.081 0.28 ravenScore 0.10658 0.06674 1.597 0.11

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 24.435  on 19  degrees of freedom

Residual deviance: 20.895 on 18 degrees of freedom AIC: 24.895

Number of Fisher Scoring iterations: 5

That’s a job well done!

|================================================================== | 69%

The coefficients estimate log odds as a linear function of points scored. They have a natural interpretation in terms of odds because, if b0 + b1*score estimates log odds, then exp(b0 + b1score)=exp(b0)exp(b1score) estimates odds. Thus exp(b0) is the odds of winning with a score of 0 (in our case 16/84,) and exp(b1) is the factor by which the odds of winning increase with every point scored. In our case exp(b1) = exp(0.10658) = 1.11. In other words, the odds of winning increase by 11% for each point scored.
However, the coefficients have relatively large standard errors. A 95% confidence interval is roughly 2 standard errors either side of a coefficient. R’s function confint() will find the exact lower and upper bounds to the 95% confidence intervals for the coefficients b0 and b1. To get the corresponding intervals for exp(b0) and exp(b1) we would just exponentiate the output of confint(mdl). Do this now.

exp(confint(mdl)) Waiting for profiling to be done… 2.5 % 97.5 % (Intercept) 0.005674966 3.106384 ravenScore 0.996229662 1.303304

Nice work!

|========================================================================== | 77%

What is the 2.5% confidence bound on the odds of winning with a score of 0 points?

1: 0.005674966 2: 2.5% 3: 0.996229662

Selection: 1

That’s the answer I was looking for.

|============================================================================== | 81%

The lower confidence bound on the odds of winning with a score of 0 is near zero, which seems much more realistic than the 16/84 figure of the maximum likelihood model. Now look at the lower bound on exp(b1), the exponentiated coefficient of ravenScore. How does it suggest the odds of winning will be affected by each additional point scored?

1: They will increase by 30% 2: They will increase slightly 3: They will decrease slightly

Selection: 2

That’s not the answer I was looking for, but try again.
If you multiply a positive number by 0.996229662, do you increase or decrease the value?

1: They will decrease slightly 2: They will increase by 30% 3: They will increase slightly

Selection: 1

All that practice is paying off!

|================================================================================= | 85%

The lower confidence bound on exp(b1) suggests that the odds of winning would decrease slightly with every additional point scored. This is obviously unrealistic. Of course, confidence intervals are based on large sample assumptions and our sample consists of only 20 games. In fact, the GLM version of analysis of variance will show that if we ignore scores altogether, we don’t do much worse.

Linear regression minimizes the squared difference between predicted and actual observations, i.e., | minimizes the variance of the residual. If an additional predictor significantly reduces the residual’s | variance, the predictor is deemed important. Deviance extends this idea to generalized linear | regression, using (negative) log likelihoods in place of variance. (For a detailed explanation, see the | slides and lectures.) To see the analysis of deviance for our model, type anova(mdl).

anova(mdl) Analysis of Deviance Table

Model: binomial, link: logit

Response: ravenWinNum

Terms added sequentially (first to last)

       Df Deviance Resid. Df Resid. Dev

NULL 19 24.435 ravenScore 1 3.5398 18 20.895

Excellent work!

|========================================================================================= | 92%

The value, 3.5398, labeled as the deviance of ravenScore, is actually the difference between the deviance of our model, which includes a slope, and that of a model which includes only an intercept, b0. This value is centrally chi-square distributed (for large samples) with 1 degree of freedom (2 parameters minus 1 parameter, or equivalently 19-18.) The null hypothesis is that the coefficient of ravenScore is zero. To confidently reject this hypothesis, we would want 3.5398 to be larger than the 95th percentile of chi-square distribution with one degree of freedom. Use qchisq(0.95, 1) to compute the threshold of this percentile.

qchisq(0.95, 1) [1] 3.841459

You’re the best!

|============================================================================================ | 96%

As you can see, 3.5398 is close to but less than the 95th percentile threshold, 3.841459, hence would be regarded as consistent with the null hypothesis at the conventional 5% level. In other words, ravenScore adds very little to a model which just guesses that the Ravens win with probability 70% (their actual record that season) or odds 7 to 3 is almost as good. If you like, you can verify this using mdl0 <- glm(ravenWinNum ~ 1, binomial, ravenData), but this concludes the Binary Outcomes example. Thank you.

Resultados de conteo (Poisson)

Las visitas a una web tienden a ocurrir de una manera independiente, una cada vez y en un cierto ratio medio. La distribución de Poisson describe procesos de este tipo. Un proceso de Poisson se caracteriza por un único parámetro, el ratio de ocurrencia esperado, que habitualmente se llama lambda. En nuestro ejemplo, lambda sería las visitas esperadas por día. Por supuesto, si una web se vuelve popular lambde irá creciendo. En otras palabras, lambda dependerá del tiempo. Usaremos la regresión de Poisson para modelar esa dependencia.

  • Poisson distribution es un modelo útil para cuentas y tasas
    • rate(tasa) = conteos por unidad de tiempo.
    • La regresión lineal con transformación es una alternativa.
  • Ejemplos de conteos
    • Número de casos de gripo en un área
    • Número de pacientes que cruzan la puerta de urgencias
  • rate data
    • Porcentaje de niños que pasan un test
    • Porcentaje de entradas en una web desde un país
    • grado de radioactividad
  • Poisson model examples
    • Modelar tasas de entradas en el tráfico de una web
    • Aproximación de probabilidades binomiales con \(p\) pequeña y \(n\) grande
    • Analizar tablas de contingencia (Conteos tabulados por variables categóricas)

Ejemplo de visitas a una web

Para este ejemplo, el tiempo es siempre un día, así que \(t = 1\), La media de Poisson se interpreta por clicks al día

# laod data
load("gaData.rda")
# convert the dates to proper formats
gaData$julian <- julian(gaData$date)
# plot visits vs dates
plot(gaData$julian,gaData$visits,pch=19,col="darkgrey",xlab="Julian",ylab="Visits")

REGRESIÓN LINEAL

El tráfico de la web puede ser modelado linealmente así \[ NH_i = \beta_0 + \beta_1 JD_i + \epsilon_i \] - \(NH_i\) = number of hits to the website - \(JD_i\) = day of the year (Julian day) - \(\beta_0\) = number of hits on Julian day 0 (1970-01-01) - \(\beta_1\) = increase in number of hits per unit day - \(\epsilon_i\) = variation due to everything we didn’t measure * El resultado obtenido es \[ E[NH_i | JD_i, \beta_0, \beta_1] = \beta_0 + \beta_1 JD_i\]

# plot the visits vs dates
plot(gaData$julian,gaData$visits,pch=19,col="darkgrey",xlab="Julian",ylab="Visits")
# perform linear regression
lm1 <- lm(gaData$visits ~ gaData$julian)
# plot regression line
abline(lm1,col="red",lwd=3)

TRANSFORMACIÓN LOGARÍTMICA

  • Si estamos interesados en el incremento relativo del tráfico, podemos aplicar el logaritmo natural del resultado, de manera que el modelo lineal queda \[ \log(NH_i) = \beta_0 + \beta_1 JD_i + \epsilon_i\]
    • \(\log(NH_i)\) = number of hits to the website
    • \(JD_i\) = day of the year (Julian day)
    • \(\beta_0\) = log number of hits on Julian day 0 (1970-01-01)
    • \(\beta_1\) = increase in log number of hits per unit day
    • \(\epsilon_i\) = variation due to everything we didn’t measure
  • Cuando tomamos el logartimo natural de la salida y se ajusta al modelo de regresión, los coeficientes exponenciados estiman cantidades basadas en las medias geométricas más que en los valores medidos
    • \(e^{E[\log(Y)]}\) = geometric mean of \(Y\)
      • geometric means are defined as \[e^{\frac{1}{n}\sum_{i=1}^n \log(y_i)} = (\prod_{i=1}^n y_i)^{1/n}\] which is the estimate for the population geometric mean
      • as we collect infinite amount of data, \(\prod_{i=1}^n y_i)^{1/n} \to E[\log(Y)]\)
    • \(e^{\beta_0}\) = estimated geometric mean hits on day 0
    • \(e^{\beta_1}\) = estimated relative increase or decrease in geometric mean hits per day
    • Nota: No podemos tomar el logaritmo natural de cero conteos, así que a menudo es necesario aádidr una constante (i.e. 1) para construir el log model
      • Añadir la constante cambia ligeramente la inerpretación del coeficiente
      • \(e^{\beta_1}\) is now the relative increase or decrease in geometric mean hits + 1 per day
  • El resultado esperado es \[E[\log(NH_i | JD_i, \beta_0, \beta_1)] = \beta_0 + \beta_1 JD_i \]
round(exp(coef(lm(I(log(gaData$visits + 1)) ~ gaData$julian))), 5)
##   (Intercept) gaData$julian 
##       0.00000       1.00231

Como se puede ver arriba, el incremento diario de clicks es 0.2%

REGRESIÓN DE POISSON

  • El modelo de poisson puede definirse como el log de la media \[\log\left(E[NH_i | JD_i, \beta_0, \beta_1]\right) = \beta_0 + \beta_1 JD_i\] or in other form \[E[NH_i | JD_i, \beta_0, \beta_1] = \exp\left(\beta_0 + \beta_1 JD_i\right)\]
    • \(NH_i\) = number of hits to the website
    • \(JD_i\) = day of the year (Julian day)
    • \(\beta_0\) = expected number of hits on Julian day 0 (1970-01-01)
    • \(\beta_1\) = expected increase in number of hits per unit day
    • Nota: El modelo de poisson difiere de la transformación logarítmica en que los coeficientes son interpretados naturalmente como un valor experado de salida, mientras que en la transformación se interpreta en la escala logarítmica de la salida
  • Podemos trasnformar el modelo de Poisson en \[E[NH_i | JD_i, \beta_0, \beta_1] = \exp\left(\beta_0 + \beta_1 JD_i\right) = \exp\left(\beta_0 \right)\exp\left(\beta_1 JD_i\right)\]
    • \(\beta_1 = E[NH_i | JD_i+1, \beta_0, \beta_1] - E[NH_i | JD_i, \beta_0, \beta_1]\)
    • \(\beta_1\) puede ser por tanto interpretada con el incremento/decremento relativo de clicks por el incremento de 1 día.
  • glm(outcome~predictor, family = "poisson") = performs Poisson regression
# plot visits vs dates
plot(gaData$julian,gaData$visits,pch=19,col="darkgrey",xlab="Julian",ylab="Visits")
# construct Poisson regression model
glm1 <- glm(gaData$visits ~ gaData$julian,family="poisson")
# plot linear regression line in red
abline(lm1,col="red",lwd=3)
# plot Poisson regression line in
lines(gaData$julian,glm1$fitted,col="blue",lwd=3)

|========= | 9%

Somwhat remarkably, the variance of a Poisson process has the same value as its mean, lambda. You can quickly illustrate this by generating, say, n=1000 samples from a Poisson process using R’s rpois(n, lambda) and calculating the sample variance. For example, type var(rpois(1000, 50)). The sample variance won’t be exactly equal to the theoretical value, of course, but it will be fairly close.

var(rpois(1000, 50)) [1] 50.21732

Excellent work!

|============ | 12%

A famous theorem implies that properly normalized sums of independent, identically distributed random variables will tend to become normally distributed as the number of samples grows large. What is that theorem?

1: The Gauss-Markov BLUE Theorem 2: The Pythagorean Theorem 3: The Central Limit Theorem

Selection: 3

That’s the answer I was looking for.

|=============== | 16%

The counts generated by a Poisson process are, strictly speaking, slightly different than the normalized sums of the Central Limit Theorem. However, the counts in a given period of time will represent sums of larger numbers of terms as lambda increases. In fact, it can be formally shown that for large lambda a Poisson distribution is well approximated by a normal. The figure illustrates this effect. It shows progression from a sparse, asymetric, Poisson probability mass function on the left, to a dense, bell-shaped curve on the right as lambda varies from 2 to 100.
qownnotes-media-cqLKvL

qownnotes-media-cqLKvL

In a Poisson regression, the log of lambda is assumed to be a linear function of the predictors. Since we will try to model the growth of visits to a web site, the log of lambda will be a linear function of the date: log(lambda) = b0 + b1*date. This implies that the average number of hits per day, lambda, is exponential in the date: lambda = exp(b0)*exp(b1)^date. Exponential growth is also suggested by the smooth, black curve drawn though the data. Thus exp(b1) would represent the percentage by which visits grow per day.
qownnotes-media-dlDSuV

qownnotes-media-dlDSuV

Our data is in a data frame named hits. Use View(hits), head(hits), or tail(hits) to examine the data now.

head(hits) date visits simplystats 1 2011-01-01 0 0 2 2011-01-02 0 0 3 2011-01-03 0 0 4 2011-01-04 0 0 5 2011-01-05 0 0 6 2011-01-06 0 0

You’re the best!

|=========================== | 28%

There are three columns of data labeled date, visits, and simplystats respectively. The simplystats column records the number of visits which are due to references from another site, the Simply Statistics blog. We’ll come back to that column later. For now, we are interested in the date and visits columns. The date will be our predictor.
Our dates are represented in terms of R’s class, Date. Verify this by typing class(hits[,‘date’]), or something equivalent.

class(hits[,‘date’]) [1] “Date”

Nice work!

|================================= | 34%

R’s Date class represents dates as days since or prior to January 1, 1970. They are essentially numbers, and to some extent can be treated as such. Dates can, for example, be added or subtracted, or easily coverted to numbers. Type as.integer(head(hits[,‘date’])) to see what I mean.

as.integer(head(hits[,‘date’])) [1] 14975 14976 14977 14978 14979 14980

You are doing so well!

|==================================== | 38%

The arithmetic properties of Dates allow us to use them as predictors. We’ll use Poisson regression to predict log(lambda) as a linear function of date in a way which maximizes the likelihood of the counts we actually see. Our formula will be visits ~ date. Since our outcomes (visits) are counts, our family will be ‘poisson’, and our third argument will be the data, hits. Create such a model and store it in a variable called mdl using the following expression or something equivalent, mdl <- glm(visits ~ date, poisson, hits).

mdl <- glm(visits ~ date,poisson, hits)

You got it!

|======================================= | 41%

The figure suggests that our Poisson regression fits the data very well. The black line is the estimated lambda, or mean number of visits per day. We see that mean visits per day increased from around 5 in early 2011 to around 10 by 2012, and to around 20 by late 2013. It is approximately doubling every year.

|========================================== | 44%

Type summary(mdl) to examine the estimated coefficients and their significance.

summary(mdl)

Call:
glm(formula = visits ~ date, family = poisson, data = hits)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-5.0466  -1.5908  -0.3198   0.9128  10.6545  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -3.275e+01  8.130e-01  -40.28   <2e-16 ***
date         2.293e-03  5.266e-05   43.55   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 5150.0  on 730  degrees of freedom
Residual deviance: 3121.6  on 729  degrees of freedom
AIC: 6069.6

Number of Fisher Scoring iterations: 5
Keep working like that and you’ll get there!

|============================================= | 47%

Both coefficients are significant, being far more than two standard errors from zero. The Residual deviance is also very significantly less than the Null, indicating a strong effect. (Recall that the difference between Null and Residual deviance is approximately chi-square with 1 degree of freedom.) The Intercept coefficient, b0, just represents log average hits on R’s Date 0, namely January 1, 1970. We will ignore it and focus on the coefficient of date, b1, since exp(b1) will estimate the percentage at which average visits increase per day of the site’s life.

|================================================ | 50%

Get the 95% confidence interval for exp(b1) by exponentiating confint(mdl, ‘date’)

exp(confint(mdl, ‘date’)) Waiting for profiling to be done… 2.5 % 97.5 % 1.002192 1.002399

Excellent job!

|=================================================== | 53%

Visits are estimated to increase by a factor of between 1.002192 and 1.002399 per day. That is, between 0.2192% and 0.2399% per day. This actually represents more than a doubling every year.

|====================================================== | 56%

Our model looks like a pretty good description of the data, but no model is perfect and we can often learn about a data generation process by looking for a model’s shortcomings. As shown in the figure, one thing about our model is ‘zero inflation’ in the first two weeks of January 2011, before the site had any visits. The model systematically overestimates the number of visits during this time. A less obvious thing is that the standard deviation of the data may be increasing with lambda faster than a Poisson model allows. This possibility can be seen in the rightmost plot by visually comparing the spread of green dots with the standard deviation predicted by the model (black dashes.) Also, there are four or five bursts of popularity during which the number of visits far exceeds two standard deviations over average. Perhaps these are due to mentions on another site.
qownnotes-media-RxDUZg

qownnotes-media-RxDUZg

It seems that at least some of them are. The simplystats column of our data records the number of visits to the Leek Group site which come from the related site, Simply Statistics. (I.e., visits due to clicks on a link to the Leek Group which appeared in a Simply Statisics post.)

|============================================================ | 62%

In the figure, the maximum number of visits occurred in late 2012. Visits from the Simply Statistics blog were also at their maximum that day. To find the exact date we can use which.max(hits[,‘visits’]). Do this now.

which.max(hits[,‘visits’]) [1] 704

That’s a job well done!

|=============================================================== | 66%

The maximum number of visits is recorded in row 704 of our data frame. Print that row by typing hits[704,].

hits[704,] date visits simplystats 704 2012-12-04 94 64

Keep working like that and you’ll get there!

|================================================================== | 69%

The maximum number of visits, 94, occurred on December 4, 2012, of which 64 came from the Simply Statistics blog. We might consider the 64 visits to be a special event, over and above normal. Can the difference, 94-64=30 visits, be attributed to normal traffic as estimated by our model? To check, we will need the value of lambda on December 4, 2012. This will be entry 704 of the fitted.values element of our model. Extract mdl$fitted.values[704] and store it in a variable named lambda.

lambda <- mdl$fitted.values[704] lambda =24.5 | You are amazing!

|===================================================================== | 72%

The number of visits explained by our model on December 4, 2012 are those of a Poisson random variable with mean lambda. We can find the 95th percentile of this distribution using qpois(.95, lambda). Try this now.

qpois(.95, lambda) [1] 33

You are quite good my friend!

|======================================================================== | 75%

So, 95% of the time we would see 33 or fewer visits, hence 30 visits would not be rare according to our model. It would seem that on December 4, 2012, the very high number of visits was due to references from Simply Statistics. To gauge the importance of references from Simply Statistics we may wish to model the proportion of traffic such references represent. Doing so will also illustrate the use of glm’s parameter, offset, to model frequencies and proportions.

|=========================================================================== | 78%

A Poisson process generates counts, and counts are whole numbers, 0, 1, 2, 3, etc. A proportion is a fraction. So how can a Poisson process model a proportion? The trick is to include the denominator of the fraction, or more precisely its log, as an offset. Recall that in our data set, ‘simplystats’ is the visits from Simply Statistics, and ‘visits’ is the total number of visits. We would like to model the fraction simplystats/visits, but to avoid division by zero we’ll actually use simplystats/(visits+1). A Poisson model assumes that log(lambda) is a linear combination of predictors. Suppose we assume that log(lambda) = log(visits+1) + b0 + b1*date. In other words, if we insist that the coefficient of log(visits+1) be equal to 1, we are predicting the log of mean visits from Simply Statistics as a proportion of total visits: log(lambda/(visits+1)) = b0 + b1*date.

|============================================================================== | 81%

glm’s parameter, offset, has precisely this effect. It fixes the coefficient of the offset to 1. To create a model for the proportion of visits from Simply Statistics, we let offset=log(visits+1). Create such a Poisson model now and store it as a variable called mdl2.
qownnotes-media-kIgbwG

qownnotes-media-kIgbwG

mdl2 <- glm(formula = simplystats ~ date, family = poisson, data = hits, offset = log(visits+1))

You are quite good my friend!

|================================================================================= | 84%

Although summary(mdl2) will show that the estimated coefficients are significantly different than zero, the model is actually not impressive. We can illustrate why by looking at December 4, 2012, once again. On that day there were 64 actual visits from Simply Statistics. However, according to mdl2, 64 visits would be extremely unlikely. You can verify this weakness in the model by finding mdl2’s 95th percentile for that day. Recalling that December 4, 2012 was sample 704, find qpois(.95, mdl2$fitted.values[704]).

qpois(.95,mdl2$fitted.values[704]) [1] 47

Nice work!

|==================================================================================== | 88%

A Poisson distribution with lambda=1000 will be well approximated by a normal distribution. What will be the variance of that normal distribution?

1: the square root of lambda. 2: lambda 3: lambda squared

Selection: 3

Not quite, but you’re learning! Try again.
The mean and the variance of a Poisson distribution are equal.

1: lambda squared 2: the square root of lambda. 3: lambda

Selection: 3

Perseverance, that’s the answer.

|======================================================================================= | 91%

When modeling count outcomes as a Poisson process, what is modeled as a linear combination of the predictors?

1: The mean 2: The log of the mean 3: The counts

Selection: 3

Nice try, but that’s not exactly what I was hoping for. Try again.
Count outcomes and their means are never negative, but linear combinations of predictors may be.

1: The log of the mean 2: The counts 3: The mean

Selection: 1

You are amazing!

|========================================================================================== | 94%

What parameter of the glm function allows you to include a predictor whose coefficient is fixed to the value 1?

1: data 2: b0 3: offset 4: family 5: formula

Selection: 3

Your dedication is inspiring!

Ajustes en funciones

Lo que tratamos de hacer ahora es encajar múltiples modelos lineales cuando la curva hace zigzags.

\(Y_i = f(X_i) + \epsilon_i\)

  • Se considera el modelo l \[Y_i = \beta_0 + \beta_1 X_i + \sum_{k=1}^d (x_i - \xi_k)_+ \gamma_k + \epsilon_{i}\] donde \((a)_+ = a\) si \(a > 0\) y \(0\) sino y \(\xi_1 \leq ... \leq \xi_d\) con concidos como knot points
  • La media \[E[Y_i] = \beta_0 + \beta_1 X_i + \sum_{k=1}^d (x_i - \xi_k)_+ \gamma_k\] es contínua en los knot points
    • para \(\xi_k = 5\), el valor esperado oara \(Y_i\) cuando \(x_i\) se aproxima a 5 por la izquierda es \[\begin{aligned} E[Y_i]_{\xi = 5 | left} & = \beta_0 + \beta_1 x_i + (x_i - 5)_+ \gamma_k \\ (since~x_i<5)~ E[Y_i]_{\xi = 5 | left} & = \beta_0 + \beta_1 x_i \\ \end{aligned}\]
    • el valor esperado de \(Y_i\) cuando \(x_i\) se aproxima 5 por la derecha es \[\begin{aligned} E[Y_i]_{\xi = 5 | right} & = \beta_0 + \beta_1 x_i + (x_i - 5)_+ \gamma_k \\ (since~x_i>5)~ E[Y_i]_{\xi = 5 | right} & = \beta_0 + \beta_1 x_i + (x_i - 5) \gamma_k\\ (simplify)~ E[Y_i]_{\xi = 5 | right} & = \beta_0 - 5 \gamma_k + (\beta_1 + \gamma_k) x_i \\ \end{aligned}\]
    • como se puede ver, el lado derecho es sólo otra linea con diferente intercept (\(\beta_0 - 5 \gamma_k\)) y pendiente (\(\beta_1 + \gamma_k\))
    • Mientras \(x\) se aproxima a 5, ambos lados convergen

Example - Fitting Piecewise Linear Function

# simulate data
n <- 500; x <- seq(0, 4 * pi, length = n); y <- sin(x) + rnorm(n, sd = .3)
# define 20 knot points
knots <- seq(0, 8 * pi, length = 20);
# define the ()+ function to only take the values that are positive after the knot pt
splineTerms <- sapply(knots, function(knot) (x > knot) * (x - knot))
# define the predictors as X and spline term
xMat <- cbind(x, splineTerms)
# fit linear models for y vs predictors
yhat <- predict(lm(y ~ xMat))
# plot data points (x, y)
plot(x, y, frame = FALSE, pch = 21, bg = "lightblue")
# plot fitted values
lines(x, yhat, col = "red", lwd = 2)

El problema de esta aproximación es que la recta no es contínua ni integrabel en los knot points, si queremos que lo sea deberemos añadir el cuadrado a los términos, y el modelo se transforma en \[Y_i = \beta_0 + \beta_1 X_i + \beta_2 X_i^2 + \sum_{k=1}^d (x_i - \xi_k)_+^2 \gamma_k + \epsilon_{i}\] donde \((a)^2_+ = a^2\) if \(a > 0\) y \(0\)

Añadir términos cúbivos lo haría continuamente integrable en los knot points doblemente,etc..

# define the knot terms in the model
splineTerms <- sapply(knots, function(knot) (x > knot) * (x - knot)^2)
# define the predictors as x, x^2 and knot terms
xMat <- cbind(x, x^2, splineTerms)
# fit linear models for y vs predictors
yhat <- predict(lm(y ~ xMat))
# plot data points (x, y)
plot(x, y, frame = FALSE, pch = 21, bg = "lightblue")
# plot fitted values
lines(x, yhat, col = "red", lwd = 2)

TEST4

qownnotes-media-lALHIT

qownnotes-media-lALHIT

qownnotes-media-TwmpQG

qownnotes-media-TwmpQG

qownnotes-media-CAkQqH

qownnotes-media-CAkQqH

qownnotes-media-EYMoTw

qownnotes-media-EYMoTw

qownnotes-media-keCAhs

qownnotes-media-keCAhs

qownnotes-media-MiXJKa

qownnotes-media-MiXJKa

Miguel Angel Huerta

16 de octubre de 2018